Unable to insert UTF-8 character with fast_executemany option #617

maimat1 · 2019-09-02T17:54:35Z

I am trying to copy data from one SQL Server instance to another using pyODBC package and I encountered an error while handling UTF-8 characters on Windows.

I have narrowed it down to this snippet, which fails for me:

import pyodbc
import sys
print('system:' + str(sys.getwindowsversion()))
print('python:' + sys.version)
print('pyodbc:' + pyodbc.version)
cnxn = pyodbc.connect(
    r'''Driver=ODBC Driver 17 for SQL Server;Server=localhost;'''
    r'''Database=DB_Pros;port=1433;Trusted_Connection=yes'''
    )
crsr = cnxn.cursor()
crsr.fast_executemany=True

crsr.execute("""DROP TABLE IF EXISTS dbo.testing_table""")
crsr.execute("""CREATE TABLE dbo.testing_table (col1 nvarchar(max) null)""")
crsr.executemany("""INSERT INTO dbo.testing_table (col1) VALUES (?)""", [[' 🎥 ',]])

return_value = crsr.execute("SELECT * FROM dbo.testing_table").fetchone()

print(return_value )

crsr.close()
cnxn.close()

I get the following error:

system:sys.getwindowsversion(major=10, minor=0, build=18362, platform=2, service_pack='')
python:3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
pyodbc:4.0.27
Traceback (most recent call last):
  File "test.py", line 15, in <module>
    crsr.executemany("""INSERT INTO dbo.testing_table (col1) VALUES (?)""", [[' 🎥 ',]])
pyodbc.DataError: ('22026', '[22026] [Microsoft][ODBC Driver 17 for SQL Server]String data, length mismatch (0) (SQLParamData)')

Everything works fine when I set the crsr.fast_executemany=False.

This seems to be somewhat related to #246 , but this happens on Windows system.

The text was updated successfully, but these errors were encountered:

v-chojas · 2019-09-04T17:13:55Z

Could you post an ODBC trace?

maimat1 · 2019-09-05T11:17:34Z

Hi, thank you for looking into this.

Below attached is the ODBC trace, I have executed the snippet posted in the initial comment:
SQL.LOG

v-makouz · 2019-09-06T22:59:07Z

It looks like Python3 Unicode conversions don't really handle characters that require two UTF-16 codepoints very well, so "space", "🎥", "space" gets mangled.

I need to dig a lot deeper to see if there is a neat fix for this, but one work around is to manually escape the codepoints with a "\u" in front of each one of the pair, so in case of "🎥", which is "d83c dfa5" in UTF-16, you'd put "\ud83c\udfa5"

gordthompson · 2019-09-08T14:21:34Z

For me, with Python 3.7.4 under Windows 7, " \ud83c\udfa5 " doesn't seem to produce a valid string:

str2 = " \ud83c\udfa5 "
print(str2)
# UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1-2: surrogates not allowed

However, " \U0001F3A5 " works fine.

str2 = " \U0001f3a5 "
print(str2)
#  🎥

Unfortunately, it isn't really a workaround because the string parameter value is the same whether it is defined using a string literal (" 🎥 ") or an escaped Unicode code point (as above).

I also notice that the example code fails if the column is declared as nvarchar(max) (which triggers DAE) but does not fail if the column is declared as nvarchar(50).

maimat1 · 2019-09-10T06:22:46Z

I also notice that the example code fails if the column is declared as nvarchar(max) (which triggers DAE) but does not fail if the column is declared as nvarchar(50).

Yes, I forgot to mention this in the original description - this is only an issue if you insert to nvarchar(max) column with fast_executemany flag using cursor.executemany.

However, the string " 🎥 " was originally only a part of a much longer text stored in nvarchar(max) column from which I isolated the problematic part in the snippet, so changing the datatype of the column is unfortunately not a solution.

v-makouz · 2019-09-10T15:49:39Z

I know what the problem is now, it's due to some confusion between number of characters and number of codepoints in conversion code. I checked in a fix that should work for this case:

v-makouz@606b4a9

However in the process of testing it, I discovered that when using Python2 on Windows, inserting any Unicode character in this manner doesn't work, because it doesn't do any conversion and so gives the driver the data in the source format (i.e. UTF-8), which the driver doesn't interpret correctly, as Windows itself is not set to UTF-8, so the driver treats it as 1252

So I will see what to do about that and will make a PR once I have a fix for that case as well.

gordthompson · 2019-09-10T16:24:21Z

I discovered that when using Python2 on Windows, inserting any Unicode character in this manner doesn't work

If @mkleehammer feels so inclined perhaps we could take this opportunity to discuss the future of Python2 support.

maimat1 · 2019-09-15T14:26:59Z

Thank you for your support, I have manually built pyodbc with proposed fix (v-makouz@606b4a9) and it seems to work flawlessly since then (on Windows 10, Python 3.7.1).

mkleehammer · 2019-10-10T20:41:04Z

@gordthompson What are your thoughts on Python 2 support? I'd love to stop updates for it. I personally maintain a large code base still using Python 2 for some of the microservices, but because they are critical 24x7 services I rarely want to update the the pyodbc versions. (They'll hopefully be ported in 12 months or so.). I'd be happy to make occasional security / crash fixes on a dedicated Python 2 branch for a couple of years.

If we take that step. we should also look at what the minimum Python 3 version should be. I guess that would depend on which LTS OS versions are out there.

gordthompson · 2019-10-11T17:35:31Z

I'd be happy to make occasional security / crash fixes on a dedicated Python 2 branch for a couple of years.

In that case I would propose the following:

At the end of 2019, create a pyodbc4 branch. The master branch will then be for v5 (and beyond) and will be for Python_3(+) only. Probably no need to do a massive cleanup to remove all Python_2 code from master; as other changes are made to a particular source file just remove the Python_2-specific stuff at the same time. Eventually the Python_2 remnants will disappear.

If we take that step. we should also look at what the minimum Python 3 version should be. I guess that would depend on which LTS OS versions are out there.

My personal preference would be for 3.6+, but I notice that Ubuntu 16.04 LTS still distributes 3.5. Ubuntu 16.04 is on Standard Support until April 2021 (and EOL three years later) so it will be around for a while. So, we might have to go with Python_3.5+, at least at the source level (e.g., no f'{string}'s 😢) for the time being. For Windows we could probably get away with only building 3.6+ wheels. (I don't know about Macs.)

timnyborg · 2019-11-06T10:06:47Z

I know what the problem is now, it's due to some confusion between number of characters and number of codepoints in conversion code. I checked in a fix that should work for this case:

v-makouz@606b4a9

However in the process of testing it, I discovered that when using Python2 on Windows, inserting any Unicode character in this manner doesn't work, because it doesn't do any conversion and so gives the driver the data in the source format (i.e. UTF-8), which the driver doesn't interpret correctly, as Windows itself is not set to UTF-8, so the driver treats it as 1252

So I will see what to do about that and will make a PR once I have a fix for that case as well.

Does this fix enable high unicode chars for inserts that don't use parameterized queries?
E.g.
cnxn.execute("""INSERT INTO dbo.testing_table (col1) VALUES (N'🎥')""")

instead of
cnxn.execute("""INSERT INTO dbo.testing_table (col1) VALUES (?)""", [[' 🎥 ',]])

gordthompson · 2019-11-06T15:03:04Z

@timnyborg - Inserts using N'...' literals already work with supplementary characters. Do you have an example of a case where they don't?

timnyborg · 2019-11-06T16:07:59Z

Well, while the parameterized method works on my stack (Python 3.7, MSSQL, Pyodbc 4.0.27, FreeTDS 1.00.40), the literals cause a sql syntax error. Perhaps it's a separate issue from the interaction with FreeTDS?

cnxn = pyodbc.connect(...)
res = cnxn.execute("update person set surname = ? where id = 1", '🎥') # Works
res = cnxn.execute("update person set surname = N'🎥' where id = 1")

pyodbc.ProgrammingError: ('42000', "[42000] [FreeTDS][SQL Server]Unclosed quotation mark 
after the character string ''. (105) (SQLExecDirectW)")

v-chojas · 2019-11-06T16:10:24Z

Try the Microsoft ODBC Driver for SQL Server instead: https://docs.microsoft.com/en-us/sql/connect/odbc/microsoft-odbc-driver-for-sql-server

timnyborg · 2019-11-06T17:04:18Z

Yep, works with the Microsoft drivers (didn't even know they were available!)

However, needed to implement an OpenSSL fix on Debian 10 to get around error 0x2746:
microsoft/msphpsql#1021 (comment)

Thanks, all.

ambarbour · 2019-11-25T19:48:53Z

Is the fix for this problem (parameters not literals) currently scheduled for any release?

In the interim is there a workaround, a way of 'pre-treating' the strings to bypass the error? The closest I can get is val.encode('unicode-escape').decode('utf-8') anything going to nvarchar(max), but then they look pretty bad because it encodes even unicode chars below 256, such as common western language accented vowels.

I load/reload a lot of data daily, many millions of rows, so I need fast_executemany. I just swapped out turbodbc, because it has its own problems with unicode strings longer than 1600 chars, and now I have hit this issue.

Any advice is appreciated, thanks.

ambarbour · 2019-11-25T21:03:29Z

Well, it's not pretty but it works:

re_multipoint_unicode = re.compile(f'[{chr(0x100)}-{chr(0x10FFFF)}]+')
bad_str = chr(0xFF) + chr(0x1F609)
good_str = re_multipoint_unicode.sub(lambda v: codecs.decode(codecs.unicode_escape_encode(v.group())[0]), bad_str)

If there is a better way, and there probably is, please let me know...

Thanks

* Merging updates. (#1) Merging updates. * fix for smalldatetime issue * Fixed a bad merge * Fix for inserting high unicode chars * merge with main branch * Fix for function sequence error * reverted unnecessary file changes * removed obsolete include * fix for 540 * fix for TVP type mismatch issue * Combined the IFs * Fix for high unicode insertion, WIP * Fix python2 high unicode insertion * Renamed a table to t1 Co-authored-by: v-chojas <25211973+v-chojas@users.noreply.github.com> Co-authored-by: Michael Kleehammer <michael@kleehammer.com>

gordthompson mentioned this issue Oct 15, 2019

Continue with Windows (and Mac?) wheels for Python 3.4? #635

Closed

ethan-deng mentioned this issue Nov 29, 2019

Cursor throws an error when trying to insert in SQL Server database when fast_executemany flag is set #371

Closed

maimat1 closed this as completed Feb 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to insert UTF-8 character with fast_executemany option #617

Unable to insert UTF-8 character with fast_executemany option #617

maimat1 commented Sep 2, 2019

v-chojas commented Sep 4, 2019

maimat1 commented Sep 5, 2019

v-makouz commented Sep 6, 2019

gordthompson commented Sep 8, 2019 •

edited

Loading

maimat1 commented Sep 10, 2019

v-makouz commented Sep 10, 2019

gordthompson commented Sep 10, 2019

maimat1 commented Sep 15, 2019

mkleehammer commented Oct 10, 2019

gordthompson commented Oct 11, 2019 •

edited

Loading

timnyborg commented Nov 6, 2019

gordthompson commented Nov 6, 2019

timnyborg commented Nov 6, 2019

v-chojas commented Nov 6, 2019

timnyborg commented Nov 6, 2019

ambarbour commented Nov 25, 2019

ambarbour commented Nov 25, 2019

Unable to insert UTF-8 character with fast_executemany option #617

Unable to insert UTF-8 character with fast_executemany option #617

Comments

maimat1 commented Sep 2, 2019

v-chojas commented Sep 4, 2019

maimat1 commented Sep 5, 2019

v-makouz commented Sep 6, 2019

gordthompson commented Sep 8, 2019 • edited Loading

maimat1 commented Sep 10, 2019

v-makouz commented Sep 10, 2019

gordthompson commented Sep 10, 2019

maimat1 commented Sep 15, 2019

mkleehammer commented Oct 10, 2019

gordthompson commented Oct 11, 2019 • edited Loading

timnyborg commented Nov 6, 2019

gordthompson commented Nov 6, 2019

timnyborg commented Nov 6, 2019

v-chojas commented Nov 6, 2019

timnyborg commented Nov 6, 2019

ambarbour commented Nov 25, 2019

ambarbour commented Nov 25, 2019

gordthompson commented Sep 8, 2019 •

edited

Loading

gordthompson commented Oct 11, 2019 •

edited

Loading