Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use native prepared statements from trino-python-client #61

Merged
merged 2 commits into from
May 9, 2022
Merged

Use native prepared statements from trino-python-client #61

merged 2 commits into from
May 9, 2022

Conversation

mdesmet
Copy link
Member

@mdesmet mdesmet commented Apr 15, 2022

Overview

Remove prepared statements emulation from dbt-trino and rely on the prepared statements support of native python types from trino-python-client

Checklist

  • This PR includes tests, or tests are not required/relevant for this PR
  • README.md updated and added information about my change
  • CHANGELOG.md updated and added information about my change

@@ -64,7 +62,7 @@ def __init__(self, handle):
self._fetch_result = None

def cursor(self):
self._cursor = self.handle.cursor()
self._cursor = self.handle.cursor(experimental_python_types=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should give the ability to the user to opt in/out of the "experimental" feature.
Should we introduce a profile configuration setting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the conversions in ConnectionWrapper._escape_value(cls, value) are all implemented by the trino-python-client, which is the right place to implement them. The covered data types are not really that 'experimental', in the sense that the same Python data type constraints apply as in the earlier code. So I consider this some kind of technical debt that we should remove.

Not setting this parameter consistently would also prevent us from consistently fixing #28.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This going to be an impactful change for the users, but I agree that we need to introduce it since some Python types were mapped correctly "by accident".

Perhaps enable experimental_python_types by default and introduce a profile configuration setting in case someone would like to use the old approach (to not impact current pipelines and migrate them later on). But I doubt it is a common situation.

Could you please point out the code in trino-python-client which converts for instance:

if value is None:
    return "NULL"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the code that handles the prepared statements parameters

https://github.com/trinodb/trino-python-client/blob/e21b06d2a2c4e6a2b95fbab96d9168ae15b28cc9/trino/dbapi.py#L340-L408

As you can see it handles following cases:

None -> NULL (same)
string -> quoted (same)
pyhon datetime -> trino TIMESTAMP (agate doesn't do anything with timezones, also currently the python-trino-client only supports up to millisecond precision, not completely sure why we strip out the precision within dbt-trino) (same)
python date -> trino DATE (same)
NUMBER -> specific handling for integer, decimal (slightly different)

The number handling seems slightly different but without any impact. Basically every number becomes a Decimal through the agate type detection. See agate docs.

image

Trino can handle conversions from decimal to other numeric types. Any loss in precision is the result of conversion to the chosen target column type.

CREATE TABLE test_delta.test.test_table1 (a integer);
insert into test_delta.test.test_table1 (a) values (decimal '1.2')
select * from test_delta.test.test_table1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add fallback logic to the "classic" binding manner and eventually remove it (add an extra issue for follow up).

Even if the changes are mostly about dbt seed we should however ensure that the users have the option to fallback to the classic functionality with the newer version of dbt-trino in case they will deal with unforeseen exceptions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I thought a little more about it. Actually I think it should be fixed in trino-python-client by not creating lines bigger http.client._MAXLINE. Then no configuration should be needed.

Still need to validate this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, let's set up it in trino-python-client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end I decided to put it in dbt-trino as patching Python's http.client may impact other usages, while in the context of dbt command executions this impact is limited. Currently we patch the http.client by default. It can be disabled by setting patch_http_client_header_limit to false in your dbt profile. This is also documented in the README.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played a bit with a 15K records CSV file

Note that dbt seed is more thought for test purposes.
Is this a problem that we should be solving?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use cases of dbt seed are definitely much broader then test purposes only.

The issue is that some users may have some seed files that may fail if we remove the http client patch. It won't be obvious for them to find out what they need to do. That was my motivation to add this as a default.

Copy link
Contributor

@hovaesco hovaesco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@@ -173,3 +173,7 @@
{% macro trino__current_timestamp() -%}
CURRENT_TIMESTAMP
{%- endmacro %}

{% macro trino__get_binding_char() %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why trino__get_binding_char() is needed and where is used? Is default macro from dbt-core not working with trino?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So before we converted bindings using standard Python string interpolation:

bindings = tuple(self._escape_value(b) for b in bindings)
sql = sql % bindings

Now we are using Trino prepared statements. As we can then depend on the type support added in trino-python-client, which covers more data types and a better unit test coverage.

To make the prepared statements work, the standard %s dbt binding char need to be replaced with the Trino binding char ?, similar like many other dbt adapters, and we pass the bindings down to the execute function of Trino's dbapi.

result = self._cursor.execute(sql, params=bindings)

@@ -64,7 +62,7 @@ def __init__(self, handle):
self._fetch_result = None

def cursor(self):
self._cursor = self.handle.cursor()
self._cursor = self.handle.cursor(experimental_python_types=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This going to be an impactful change for the users, but I agree that we need to introduce it since some Python types were mapped correctly "by accident".

Perhaps enable experimental_python_types by default and introduce a profile configuration setting in case someone would like to use the old approach (to not impact current pipelines and migrate them later on). But I doubt it is a common situation.

Could you please point out the code in trino-python-client which converts for instance:

if value is None:
    return "NULL"

@findinpath
Copy link
Collaborator

nit: Please use a capital letter in the first word of the commit message:

  • upgrade trino-python-client -> Upgrade trino-python-client
  • align pytz with trino-python-client -> Align pytz with trino-python-client

@@ -64,7 +62,7 @@ def __init__(self, handle):
self._fetch_result = None

def cursor(self):
self._cursor = self.handle.cursor()
self._cursor = self.handle.cursor(experimental_python_types=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the main functionalities of dbt-trino affected by this change? Is it only dbt seed ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would also return python types instead of strings. So there might be more places then only seeds, #28 for example.

Also any dbt packages expecting eg a python datetime or a number instead a string to do some calculations.

Note that most of the other dbapi return proper Python data types and therefore most of the libraries took that as an assumption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested if the PR solves #28 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, but will do by next week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source freshness has been succesfully tested

@mdesmet
Copy link
Member Author

mdesmet commented Apr 28, 2022

I've also tested the source freshness feature

❯ dbt source freshness
07:56:04  Running with dbt=1.0.3
07:56:04  Unable to do partial parsing because profile has changed
07:56:05  Found 19 models, 14 tests, 0 snapshots, 0 analyses, 375 macros, 0 operations, 5 seed files, 13 sources, 0 exposures, 0 metrics
07:56:05  
07:56:05  Concurrency: 8 threads (target='trino')
07:56:05  
07:56:05  1 of 1 START freshness of replica_dwh_screenings.rd_dag......................... [RUN]
07:56:06  1 of 1 PASS freshness of replica_dwh_screenings.rd_dag.......................... [PASS in 0.85s]
07:56:06  Done.
❯ dbt source freshness
07:56:27  Running with dbt=1.0.3
07:56:27  Found 19 models, 14 tests, 0 snapshots, 0 analyses, 375 macros, 0 operations, 5 seed files, 13 sources, 0 exposures, 0 metrics
07:56:27  
07:56:27  Concurrency: 8 threads (target='trino')
07:56:27  
07:56:27  1 of 1 START freshness of replica_dwh_screenings.rd_dag......................... [RUN]
07:56:28  1 of 1 ERROR STALE freshness of replica_dwh_screenings.rd_dag................... [ERROR STALE in 0.57s]
07:56:28  

README.md Outdated
| host | The hostname to connect to | Required | `127.0.0.1` |
| port | The port to connect to the host on | Required | `8080` |
| threads | How many threads dbt should use | Optional (default is `1`) | `8` |
| patch_http_client_header_limit | [Patch python's http client to work around LineTooLong limit](#prepared-statements) | Optional (default is `true`) | `true` or `false` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch_http_client_header_limit please make sure that the tabs are aligned.

headers.append(line)
if line in (b'\r\n', b'\n', b''):
break
header_string = b''.join(headers).decode('iso-8859-1')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Latin-1 encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the encoding according the http spec.

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

@findinpath
Copy link
Collaborator

@mdesmet please consider shrinking this PR / spin-off part of it in another PR to enable the source freshness functionality and leave the controversial discussion around the line too long for dbt seed aside.

@mdesmet
Copy link
Member Author

mdesmet commented May 4, 2022

@mdesmet please consider shrinking this PR / spin-off part of it in another PR to enable the source freshness functionality and leave the controversial discussion around the line too long for dbt seed aside.

As discussed I've added a new flag, also documented in the README.

if TrinoCredentials._HTTP_CLIENT_PATCHED:
return
if (self.patch_http_client_header_limit):
TrinoCredentials.patch_http_client()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use TrinoCredentials and not self. ?

def parse_headers(fp, _class=HTTPMessage):
headers = []
while True:
line = fp.readline(_MAXLINE + 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this code snippet taken from?
I'm not keen into adding this workaround in the current form in the codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been taken from Stack overflow.

I also had my doubts about the encoding and investigated that at that time and definitely don't like patching standard libraries as they also may impact any other code executed after this patch. So I tend to agree with you, however dbt code is in general executed through a dbt command, so this patch will probably not impact any other applications outside of very exotic usage of dbt as a library.

@@ -64,7 +62,7 @@ def __init__(self, handle):
self._fetch_result = None

def cursor(self):
self._cursor = self.handle.cursor()
self._cursor = self.handle.cursor(experimental_python_types=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played a bit with a 15K records CSV file

Note that dbt seed is more thought for test purposes.
Is this a problem that we should be solving?

@@ -21,6 +21,8 @@


logger = AdapterLogger("Trino")
PATCH_HTTP_CLIENT_DEFAULT = True
DISABLE_PREPARED_STATEMENTS_DEFAULT = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about flipping around the name -> PREPARED_STATEMENTS_ENABLED ?

_ALIASES = {"catalog": "database"}
_HTTP_CLIENT_PATCHED = False
Copy link
Collaborator

@findinpath findinpath May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http client patching is not in any way critical for this PR.

Can we get this change in a different PR to concentrate more thoroughly on it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we added the flag for the prepared statements, that's true. It may however slightly increase the odds of the defaults failing.

I had discussed this issue earlier with @hovaesco, in the end I think Trino should play nicely with the Python ecosystem, so Trino should abide to any limits of Python (such as http.client._LINE). So ideally we get this fixed in Trino itself.

It's just that compared to Python scripting, the interface of dbt is not Python but the dbt command, so it's not easy for users to add this Python snippet and override the Python environment. So I thought it would be nice to ensure smooth operations of any existing large seed files and/or bumped up batch size that worked before but not anymore with the prepared statements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against adding this kind of patch.
I would rather discuss it more thoroughly (in a separate PR) before actually making it generally available.

@mdesmet
Copy link
Member Author

mdesmet commented May 5, 2022

I made use of the new test framework to add some integration tests. I also added a trino_connection fixture to be able to directly query data on Trino if necessary. Used here to detect if the dbt seed execution used prepared statements or not.

# trino doesn't actually pass bindings along so we have to do the
# escaping and formatting ourselves
if not self._prepared_statements_enabled and bindings is not None:
# DEPRECATED: by default prepared statements are used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do create a follow-up request to remove this safety feature after 1,2 releases.

Copy link
Collaborator

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

{{ return('?') }}
{%- else -%}
{{ return('%s') }}
{%- endif -%}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  {%- if target.prepared_statements_enabled|as_bool -%}
      {{ return('?') }}
  {%- else -%}
      {{ return('%s') }}
  {%- endif -%}

@@ -1,4 +1,5 @@
import pytest
import trino

# Import the fuctional fixtures as a plugin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls add a Fix typo commit for changing "fuctional" to "functional"


# The actual sequence of dbt commands and assertions
# pytest will take care of all "setup" + "teardown"
def test_run_seed_with_prepared_statements_disabled(self, project, trino_connection):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both methods:

  • test_run_seed_with_prepared_statements_disabled
  • test_run_seed_with_prepared_statements_enabled

look roughly the same. Extract the common logic to the base class.

Copy link
Contributor

@hovaesco hovaesco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only one small comment.

tests/conftest.py Show resolved Hide resolved
@findinpath findinpath self-requested a review May 9, 2022 10:25
@hovaesco hovaesco merged commit 70e4449 into starburstdata:master May 9, 2022
@mdesmet mdesmet deleted the feature/types branch August 8, 2022 15:15
EminUZUN pushed a commit to EminUZUN/dbt-trino that referenced this pull request Feb 14, 2023
…659d827fb85fb65601b36a

[Snyk] Security upgrade openjdk from 8-jre to 16-ea-17
EminUZUN pushed a commit to EminUZUN/dbt-trino that referenced this pull request Feb 14, 2023
Use native prepared statements from trino-python-client
damian3031 pushed a commit to damian3031/dbt-trino that referenced this pull request Sep 9, 2024
Use native prepared statements from trino-python-client
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants