Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress UnicodeEncodeError when executing to_csv method #27750

Closed
shigemk2 opened this issue Aug 5, 2019 · 7 comments
Closed

Suppress UnicodeEncodeError when executing to_csv method #27750

shigemk2 opened this issue Aug 5, 2019 · 7 comments
Labels
Enhancement IO CSV read_csv, to_csv

Comments

@shigemk2
Copy link

shigemk2 commented Aug 5, 2019

Code Sample, a copy-pastable example if possible

# error pattern
import pandas as pd

unicode_data = [["key", "\u070a"]]
df = pd.DataFrame(unicode_data)
df.to_csv("./test.csv", encoding="cp932") # UnicodeEncodeError: 'cp932' codec can't encode character '\u070a' in position 6: illegal multibyte sequence
# good pattern
import pandas as pd

unicode_data = [["key", "\u070a"]]
df = pd.DataFrame(unicode_data)
with open("./test.csv", mode="w", encoding="cp932", errors="ignore") as f:
    df.to_csv(f)

Problem description

UnicodeEncodeError occurs when executing to_csv with eoncode parameter SHIFT-JIS or cp932.
We are able to avoid this error using with open(good pattern), this code is redundant.
So I want to suppress UnicodeEncodeError with to_csv's parameter.

Expected Output

# good pattern
import pandas as pd

unicode_data = [["key", "\u070a"]]
df = pd.DataFrame(unicode_data)
df.to_csv("./test.csv", encoding="cp932", ignore_error=True)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.2.final.0
python-bits : 64
OS : Darwin
OS-release : 18.6.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : ja_JP.UTF-8
LOCALE : ja_JP.UTF-8

pandas : 0.25.0
numpy : 1.16.2
pytz : 2018.9
dateutil : 2.8.0
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : 4.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 2.6.1
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : 1.3.1
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

@bsolomon1124
Copy link

bsolomon1124 commented Aug 7, 2019

So, it seems like what is being asked here is whether Pandas could add an errors parameter to both to_csv() and read_csv() that is analogous to that of open(). That would entail adding the parameter to CSVFormatter as well. From there, I think it may need to be passed to _get_handle(), and then in turn to open().

I.e.

>>> from pandas.io.common import _get_handle
>>> # This should also take an 'errors' arg 
>>> f, handles = _get_handle("test_cp932.csv", "w", encoding="cp932")
>>> f
<_io.TextIOWrapper name='test_cp932.csv' mode='w' encoding='cp932'>
>>> f.errors
'strict'

That would presumably get passed to open() here.

(All of this is pd.__version__ == '0.25.0'.)

@TomAugspurger
Copy link
Contributor

Adding an errors parameter, passed through to the file open call, seems reasonable. Or documenting passing a file handle for this use case.

@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Aug 7, 2019
@TomAugspurger TomAugspurger added IO Data IO issues that don't fit into a more specific label IO CSV read_csv, to_csv labels Aug 7, 2019
shigemk2 added a commit to shigemk2/pandas that referenced this issue Aug 15, 2019
errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
shigemk2 added a commit to shigemk2/pandas that referenced this issue Aug 15, 2019
errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
shigemk2 added a commit to shigemk2/pandas that referenced this issue Aug 15, 2019
…v#27750)

encoding_errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
shigemk2 added a commit to shigemk2/pandas that referenced this issue Aug 15, 2019
…v#27750)

encoding_errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
shigemk2 added a commit to shigemk2/pandas that referenced this issue Aug 19, 2019
…v#27750)

encoding_errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
shigemk2 added a commit to shigemk2/pandas that referenced this issue Aug 19, 2019
…v#27750)

encoding_errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
shigemk2 added a commit to shigemk2/pandas that referenced this issue Aug 19, 2019
…v#27750)

encoding_errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
shigemk2 added a commit to shigemk2/pandas that referenced this issue Aug 20, 2019
…v#27750)

encoding_errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
shigemk2 added a commit to shigemk2/pandas that referenced this issue Sep 2, 2019
…v#27750)

encoding_errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
@roberthdevries
Copy link
Contributor

This looks like a duplicate of #22610

@mroeschke mroeschke added Enhancement and removed IO Data IO issues that don't fit into a more specific label labels May 2, 2020
@tgmof
Copy link

tgmof commented Jun 10, 2020

Good news #22610 was fixed by #32702 and recently merged.
Thus I think this issue can be closed @shigemk2 @roberthdevries @mroeschke

@tgmof
Copy link

tgmof commented Jun 10, 2020

Though as per @bsolomon1124 's remark, it would be meaningful to add this errors argument in read_csv as well

@lithomas1 lithomas1 added the Closing Candidate May be closeable, needs more eyeballs label Mar 10, 2021
@linehammer
Copy link

On Windows, many editors assume the default ANSI encoding (CP1252 on US Windows) instead of UTF-8 if there is no byte order mark (BOM) character at the start of the file. Files store bytes, which means all unicode have to be encoded into bytes before they can be stored in a file. read_csv takes an encoding option to deal with files in different formats. So, you have to specify an encoding, such as utf-8.

df.to_csv('D:\panda.csv',sep='\t',encoding='utf-8')

If you don't specify an encoding, then the encoding used by df.to_csv defaults to ascii in Python2, or utf-8 in Python3.

Also, you can encode a problematic series first then decode it back to utf-8.

df['column-name'] = df['column-name'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))

This will also rectify the problem.

@mroeschke mroeschke removed the Closing Candidate May be closeable, needs more eyeballs label Jul 10, 2021
@twoertwein
Copy link
Member

Though as per @bsolomon1124 's remark, it would be meaningful to add this errors argument in read_csv as well

Closing as to_csv (errors) and read_csv (encoding_errors) both have arguments to ignore encoding errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants