Use native prepared statements from trino-python-client #61

mdesmet · 2022-04-15T19:59:51Z

Overview

Remove prepared statements emulation from dbt-trino and rely on the prepared statements support of native python types from trino-python-client

Description:
Related issue(s):
Related code pull request(s):
Related link(s): Improved python types from trino-python-client, Source Freshness is not implemented #28

Checklist

This PR includes tests, or tests are not required/relevant for this PR
README.md updated and added information about my change
CHANGELOG.md updated and added information about my change

findinpath · 2022-04-16T03:33:56Z

dbt/adapters/trino/connections.py

@@ -64,7 +62,7 @@ def __init__(self, handle):
        self._fetch_result = None

    def cursor(self):
-        self._cursor = self.handle.cursor()
+        self._cursor = self.handle.cursor(experimental_python_types=True)


We should give the ability to the user to opt in/out of the "experimental" feature.
Should we introduce a profile configuration setting?

IMHO the conversions in ConnectionWrapper._escape_value(cls, value) are all implemented by the trino-python-client, which is the right place to implement them. The covered data types are not really that 'experimental', in the sense that the same Python data type constraints apply as in the earlier code. So I consider this some kind of technical debt that we should remove.

Not setting this parameter consistently would also prevent us from consistently fixing #28.

This going to be an impactful change for the users, but I agree that we need to introduce it since some Python types were mapped correctly "by accident".

Perhaps enable experimental_python_types by default and introduce a profile configuration setting in case someone would like to use the old approach (to not impact current pipelines and migrate them later on). But I doubt it is a common situation.

Could you please point out the code in trino-python-client which converts for instance:

if value is None: return "NULL"

Here is the code that handles the prepared statements parameters

https://github.com/trinodb/trino-python-client/blob/e21b06d2a2c4e6a2b95fbab96d9168ae15b28cc9/trino/dbapi.py#L340-L408

As you can see it handles following cases:

None -> NULL (same)
string -> quoted (same)
pyhon datetime -> trino TIMESTAMP (agate doesn't do anything with timezones, also currently the python-trino-client only supports up to millisecond precision, not completely sure why we strip out the precision within dbt-trino) (same)
python date -> trino DATE (same)
NUMBER -> specific handling for integer, decimal (slightly different)

The number handling seems slightly different but without any impact. Basically every number becomes a Decimal through the agate type detection. See agate docs.

Trino can handle conversions from decimal to other numeric types. Any loss in precision is the result of conversion to the chosen target column type.

CREATE TABLE test_delta.test.test_table1 (a integer); insert into test_delta.test.test_table1 (a) values (decimal '1.2') select * from test_delta.test.test_table1

Let's add fallback logic to the "classic" binding manner and eventually remove it (add an extra issue for follow up).

Even if the changes are mostly about dbt seed we should however ensure that the users have the option to fallback to the classic functionality with the newer version of dbt-trino in case they will deal with unforeseen exceptions.

Well, I thought a little more about it. Actually I think it should be fixed in trino-python-client by not creating lines bigger http.client._MAXLINE. Then no configuration should be needed.

Still need to validate this.

Yes, I agree, let's set up it in trino-python-client.

In the end I decided to put it in dbt-trino as patching Python's http.client may impact other usages, while in the context of dbt command executions this impact is limited. Currently we patch the http.client by default. It can be disabled by setting patch_http_client_header_limit to false in your dbt profile. This is also documented in the README.

I played a bit with a 15K records CSV file

Note that dbt seed is more thought for test purposes.
Is this a problem that we should be solving?

I think the use cases of dbt seed are definitely much broader then test purposes only.

The issue is that some users may have some seed files that may fail if we remove the http client patch. It won't be obvious for them to find out what they need to do. That was my motivation to add this as a default.

hovaesco

Looks good

hovaesco · 2022-04-16T17:49:41Z

dbt/include/trino/macros/adapters.sql

@@ -173,3 +173,7 @@
 {% macro trino__current_timestamp() -%}
    CURRENT_TIMESTAMP
 {%- endmacro %}
+
+{% macro trino__get_binding_char() %}


Why trino__get_binding_char() is needed and where is used? Is default macro from dbt-core not working with trino?

So before we converted bindings using standard Python string interpolation:

dbt-trino/dbt/adapters/trino/connections.py

Lines 103 to 104 in 338475a

bindings = tuple(self._escape_value(b) for b in bindings)

sql = sql % bindings

Now we are using Trino prepared statements. As we can then depend on the type support added in trino-python-client, which covers more data types and a better unit test coverage.

To make the prepared statements work, the standard %s dbt binding char need to be replaced with the Trino binding char ?, similar like many other dbt adapters, and we pass the bindings down to the execute function of Trino's dbapi.

result = self._cursor.execute(sql, params=bindings)

hovaesco · 2022-04-16T18:03:09Z

dbt/adapters/trino/connections.py

@@ -64,7 +62,7 @@ def __init__(self, handle):
        self._fetch_result = None

    def cursor(self):
-        self._cursor = self.handle.cursor()
+        self._cursor = self.handle.cursor(experimental_python_types=True)


This going to be an impactful change for the users, but I agree that we need to introduce it since some Python types were mapped correctly "by accident".

Perhaps enable experimental_python_types by default and introduce a profile configuration setting in case someone would like to use the old approach (to not impact current pipelines and migrate them later on). But I doubt it is a common situation.

Could you please point out the code in trino-python-client which converts for instance:

if value is None: return "NULL"

findinpath · 2022-04-19T04:43:43Z

nit: Please use a capital letter in the first word of the commit message:

upgrade trino-python-client -> Upgrade trino-python-client
align pytz with trino-python-client -> Align pytz with trino-python-client

findinpath · 2022-04-19T04:57:44Z

dbt/adapters/trino/connections.py

@@ -64,7 +62,7 @@ def __init__(self, handle):
        self._fetch_result = None

    def cursor(self):
-        self._cursor = self.handle.cursor()
+        self._cursor = self.handle.cursor(experimental_python_types=True)


What are the main functionalities of dbt-trino affected by this change? Is it only dbt seed ?

it would also return python types instead of strings. So there might be more places then only seeds, #28 for example.

Also any dbt packages expecting eg a python datetime or a number instead a string to do some calculations.

Note that most of the other dbapi return proper Python data types and therefore most of the libraries took that as an assumption.

Have you tested if the PR solves #28 ?

Not yet, but will do by next week.

Source freshness has been succesfully tested

dbt/adapters/trino/connections.py

mdesmet · 2022-04-28T07:57:22Z

I've also tested the source freshness feature

❯ dbt source freshness
07:56:04  Running with dbt=1.0.3
07:56:04  Unable to do partial parsing because profile has changed
07:56:05  Found 19 models, 14 tests, 0 snapshots, 0 analyses, 375 macros, 0 operations, 5 seed files, 13 sources, 0 exposures, 0 metrics
07:56:05  
07:56:05  Concurrency: 8 threads (target='trino')
07:56:05  
07:56:05  1 of 1 START freshness of replica_dwh_screenings.rd_dag......................... [RUN]
07:56:06  1 of 1 PASS freshness of replica_dwh_screenings.rd_dag.......................... [PASS in 0.85s]
07:56:06  Done.
❯ dbt source freshness
07:56:27  Running with dbt=1.0.3
07:56:27  Found 19 models, 14 tests, 0 snapshots, 0 analyses, 375 macros, 0 operations, 5 seed files, 13 sources, 0 exposures, 0 metrics
07:56:27  
07:56:27  Concurrency: 8 threads (target='trino')
07:56:27  
07:56:27  1 of 1 START freshness of replica_dwh_screenings.rd_dag......................... [RUN]
07:56:28  1 of 1 ERROR STALE freshness of replica_dwh_screenings.rd_dag................... [ERROR STALE in 0.57s]
07:56:28

findinpath · 2022-05-04T15:55:53Z

README.md

+| host                           | The hostname to connect to                                                                                   | Required                                                                                                | `127.0.0.1`                      |
+| port                           | The port to connect to the host on                                                                           | Required                                                                                                | `8080`                           |
+| threads                        | How many threads dbt should use                                                                              | Optional (default is `1`)                                                                               | `8`                              |
+| patch_http_client_header_limit | [Patch python's http client to work around LineTooLong limit](#prepared-statements)                      | Optional (default is `true`)                                                                           | `true` or `false`                |


patch_http_client_header_limit please make sure that the tabs are aligned.

findinpath · 2022-05-04T15:59:29Z

dbt/adapters/trino/connections.py

+                headers.append(line)
+                if line in (b'\r\n', b'\n', b''):
+                    break
+            header_string = b''.join(headers).decode('iso-8859-1')


why Latin-1 encoding?

That's the encoding according the http spec.

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

findinpath · 2022-05-04T16:06:42Z

@mdesmet please consider shrinking this PR / spin-off part of it in another PR to enable the source freshness functionality and leave the controversial discussion around the line too long for dbt seed aside.

mdesmet · 2022-05-04T21:49:05Z

@mdesmet please consider shrinking this PR / spin-off part of it in another PR to enable the source freshness functionality and leave the controversial discussion around the line too long for dbt seed aside.

As discussed I've added a new flag, also documented in the README.

findinpath · 2022-05-04T16:00:55Z

dbt/adapters/trino/connections.py

+        if TrinoCredentials._HTTP_CLIENT_PATCHED:
+            return
+        if (self.patch_http_client_header_limit):
+            TrinoCredentials.patch_http_client()


Why do you use TrinoCredentials and not self. ?

findinpath · 2022-05-04T16:03:39Z

dbt/adapters/trino/connections.py

+        def parse_headers(fp, _class=HTTPMessage):
+            headers = []
+            while True:
+                line = fp.readline(_MAXLINE + 1)


Where is this code snippet taken from?
I'm not keen into adding this workaround in the current form in the codebase.

It has been taken from Stack overflow.

I also had my doubts about the encoding and investigated that at that time and definitely don't like patching standard libraries as they also may impact any other code executed after this patch. So I tend to agree with you, however dbt code is in general executed through a dbt command, so this patch will probably not impact any other applications outside of very exotic usage of dbt as a library.

findinpath · 2022-05-04T16:05:11Z

dbt/adapters/trino/connections.py

@@ -64,7 +62,7 @@ def __init__(self, handle):
        self._fetch_result = None

    def cursor(self):
-        self._cursor = self.handle.cursor()
+        self._cursor = self.handle.cursor(experimental_python_types=True)


I played a bit with a 15K records CSV file

Note that dbt seed is more thought for test purposes.
Is this a problem that we should be solving?

findinpath · 2022-05-05T05:05:15Z

dbt/adapters/trino/connections.py

@@ -21,6 +21,8 @@


 logger = AdapterLogger("Trino")
+PATCH_HTTP_CLIENT_DEFAULT = True
+DISABLE_PREPARED_STATEMENTS_DEFAULT = False


What about flipping around the name -> PREPARED_STATEMENTS_ENABLED ?

findinpath · 2022-05-05T05:06:19Z

dbt/adapters/trino/connections.py

    _ALIASES = {"catalog": "database"}
+    _HTTP_CLIENT_PATCHED = False


http client patching is not in any way critical for this PR.

Can we get this change in a different PR to concentrate more thoroughly on it?

Now that we added the flag for the prepared statements, that's true. It may however slightly increase the odds of the defaults failing.

I had discussed this issue earlier with @hovaesco, in the end I think Trino should play nicely with the Python ecosystem, so Trino should abide to any limits of Python (such as http.client._LINE). So ideally we get this fixed in Trino itself.

It's just that compared to Python scripting, the interface of dbt is not Python but the dbt command, so it's not easy for users to add this Python snippet and override the Python environment. So I thought it would be nice to ensure smooth operations of any existing large seed files and/or bumped up batch size that worked before but not anymore with the prepared statements.

I'm not against adding this kind of patch.
I would rather discuss it more thoroughly (in a separate PR) before actually making it generally available.

mdesmet · 2022-05-05T20:05:41Z

I made use of the new test framework to add some integration tests. I also added a trino_connection fixture to be able to directly query data on Trino if necessary. Used here to detect if the dbt seed execution used prepared statements or not.

findinpath · 2022-05-06T12:22:59Z

dbt/adapters/trino/connections.py

-            # trino doesn't actually pass bindings along so we have to do the
-            # escaping and formatting ourselves
+        if not self._prepared_statements_enabled and bindings is not None:
+            # DEPRECATED: by default prepared statements are used.


please do create a follow-up request to remove this safety feature after 1,2 releases.

findinpath

LGTM % comments

findinpath · 2022-05-06T12:29:41Z

dbt/include/trino/macros/adapters.sql

+  {{ return('?') }}
+  {%- else -%}
+  {{ return('%s') }}
+  {%- endif -%}


{%- if target.prepared_statements_enabled|as_bool -%} {{ return('?') }} {%- else -%} {{ return('%s') }} {%- endif -%}

findinpath · 2022-05-06T12:30:51Z

tests/conftest.py

@@ -1,4 +1,5 @@
 import pytest
+import trino

 # Import the fuctional fixtures as a plugin


Pls add a Fix typo commit for changing "fuctional" to "functional"

findinpath · 2022-05-06T12:41:10Z

tests/functional/adapter/materialization/test_prepared_statements.py

+
+    # The actual sequence of dbt commands and assertions
+    # pytest will take care of all "setup" + "teardown"
+    def test_run_seed_with_prepared_statements_disabled(self, project, trino_connection):


Both methods:

test_run_seed_with_prepared_statements_disabled

test_run_seed_with_prepared_statements_enabled

look roughly the same. Extract the common logic to the base class.

hovaesco

LGTM, only one small comment.

tests/conftest.py

…tatements

…659d827fb85fb65601b36a [Snyk] Security upgrade openjdk from 8-jre to 16-ea-17

Use native prepared statements from trino-python-client

mdesmet mentioned this pull request Apr 15, 2022

Source Freshness is not implemented #28

Closed

findinpath reviewed Apr 16, 2022

View reviewed changes

hovaesco reviewed Apr 16, 2022

View reviewed changes

findinpath reviewed Apr 19, 2022

View reviewed changes

dbt/adapters/trino/connections.py Outdated Show resolved Hide resolved

findinpath reviewed May 4, 2022

View reviewed changes

findinpath reviewed May 5, 2022

View reviewed changes

Align pytz with trino-python-client

3e68aec

mdesmet requested review from findinpath and hovaesco May 5, 2022 20:02

findinpath reviewed May 6, 2022

View reviewed changes

findinpath approved these changes May 6, 2022

View reviewed changes

hovaesco approved these changes May 8, 2022

View reviewed changes

tests/conftest.py Show resolved Hide resolved

findinpath self-requested a review May 9, 2022 10:25

findinpath approved these changes May 9, 2022

View reviewed changes

Use experimental_python_types from trino-python-client for prepared s…

9a48ac9

…tatements

hovaesco merged commit 70e4449 into starburstdata:master May 9, 2022

mdesmet deleted the feature/types branch August 8, 2022 15:15

EminUZUN pushed a commit to EminUZUN/dbt-trino that referenced this pull request Feb 14, 2023

Merge pull request starburstdata#61 from dbt-labs/snyk-fix-068d0587e0…

66efba9

…659d827fb85fb65601b36a [Snyk] Security upgrade openjdk from 8-jre to 16-ea-17

EminUZUN pushed a commit to EminUZUN/dbt-trino that referenced this pull request Feb 14, 2023

Merge pull request starburstdata#61 from mdesmet/feature/types

5b015d1

Use native prepared statements from trino-python-client

damian3031 pushed a commit to damian3031/dbt-trino that referenced this pull request Sep 9, 2024

Merge pull request starburstdata#61 from mdesmet/feature/types

fac822f

Use native prepared statements from trino-python-client

	bindings = tuple(self._escape_value(b) for b in bindings)
	sql = sql % bindings

		_ALIASES = {"catalog": "database"}
		_HTTP_CLIENT_PATCHED = False

Use native prepared statements from trino-python-client #61

Use native prepared statements from trino-python-client #61

Conversation

mdesmet commented Apr 15, 2022 • edited Loading

Overview

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hovaesco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath commented Apr 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdesmet commented Apr 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath commented May 4, 2022

mdesmet commented May 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath May 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdesmet commented May 5, 2022

Choose a reason for hiding this comment

findinpath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hovaesco left a comment

Choose a reason for hiding this comment

mdesmet commented Apr 15, 2022 •

edited

Loading

findinpath May 5, 2022 •

edited

Loading