to_csv() surrogates not allowed #22610

obilodeau · 2018-09-05T15:56:17Z

Code Sample

s = '\ud800'
srs = pd.Series()
srs.loc[ 0 ] = s
srs.to_csv('testcase.csv')

Stack trace:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-50-769583baba38> in <module>()
      4 srs = pd.Series()
      5 srs.loc[ 0 ] = s
----> 6 srs.to_csv('testcase.csv')

/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in to_csv(self, path, index, sep, na_rep, float_format, header, index_label, mode, encoding, compression, date_format, decimal)
   3779                            index_label=index_label, mode=mode,
   3780                            encoding=encoding, compression=compression,
-> 3781                            date_format=date_format, decimal=decimal)
   3782         if path is None:
   3783             return result

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1743                                  doublequote=doublequote,
   1744                                  escapechar=escapechar, decimal=decimal)
-> 1745         formatter.save()
   1746 
   1747         if path_or_buf is None:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
    169                 self.writer = UnicodeWriter(f, **writer_kwargs)
    170 
--> 171             self._save()
    172 
    173         finally:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save(self)
    284                 break
    285 
--> 286             self._save_chunk(start_i, end_i)
    287 
    288     def _save_chunk(self, start_i, end_i):

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save_chunk(self, start_i, end_i)
    311 
    312         libwriters.write_csv_rows(self.data, ix, self.nlevels,
--> 313                                   self.cols, self.writer)

pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 2: surrogates not allowed

Problem description

The presence of Unicode surrogates in a dataframe (or Series) causes an error in .to_csv(). This has already been fixed in .to_hdf() by allowing the errors= argument to be used where we can use the surrogatepass or surrogateescape error handler.

See the original bug report and the PR that fixed it.

Expected Output

No error.

Output of `pd.show_versions()`

I forgot to grab this before the end of my workshop and I destroyed the cloud instance. Sorry. It was Python 3.6 and pandas 0.23.4 I think.

The text was updated successfully, but these errors were encountered:

obilodeau · 2018-09-05T16:02:30Z

Forgot to say that the workaround is to make sure you have no UTF-8 surrogates in your data. In my case this meant that I needed to decode / reencode a field that came from another library.

For example:

field = field.encode('utf-8', errors='surrogatepass')

WillAyd · 2018-09-05T16:49:16Z

Is this possible to do with the stdlib csv writer?

obilodeau · 2018-09-05T20:14:13Z

This (plain open):

import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w") as _file:
   writer = csv.writer(_file)
   writer.writerow(row)

will yield the error below:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-12-c276abf97bef> in <module>()
      3 with open("test-you-can-delete.csv", "w") as _file:
      4    writer = csv.writer(_file)
----> 5    writer.writerow(row)

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

But open() supports passing a codec error handler with the errors= named argument:

import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w", errors='surrogatepass') as _file:
   writer = csv.writer(_file)
   writer.writerow(row)

This doesn't generate an error.

Implementing the named argument errors= in to_csv() satisfies the principle of least surprise. To me, this is the way to go. Having to explain why all fields should be re-encoded with encode() before using to_csv() while everything else worked without it (and used to work without it before) was a painful moment for young data scientists.

WillAyd · 2018-09-05T20:28:34Z

Makes sense - would accept a PR if you are up for it

obilodeau · 2018-09-05T20:38:45Z

It's a busy time of year but I might get back at it later.

In the meantime, if anyone else is interested at the problem, it is documented with a workaround and a link to a similar fix in to_hdf() for inspiration.

hartwork · 2020-02-20T13:08:30Z

Forgot to say that the workaround is to make sure you have no UTF-8 surrogates in your data. In my case this meant that I needed to decode / reencode a field that came from another library.

For example:
field = field.encode('utf-8', errors='surrogatepass')

I would like to point out that this approach can produce malformed UTF-8 so I'm not sure if that's a good path forward. For proof:

In [2]: list(sys.version_info)                                                                             
Out[2]: [3, 6, 10, 'final', 0]

In [3]: '\ud800'.encode('utf-8', errors='surrogatepass').decode('utf-8')                                   
[..]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

obilodeau · 2020-02-20T16:29:14Z

Of course, you should use the error handler that fits your need based on context: https://docs.python.org/3/library/codecs.html#error-handlers

I could have used surrogateescape or replace too because in my context the UTF-8 invalid content was garbage and could be discarded. Few instances of it in large dataframes were preventing a csv dump which was what was annoying and unintuitive to inexperienced python and pandas programmers.

If I still had the original data, I would update my workaround above but I can't 100% confirm that using surrogateescape works, unfortunately.

roberthdevries · 2020-03-14T15:58:03Z

take

WillAyd added the IO CSV read_csv, to_csv label Sep 5, 2018

WillAyd added this to the Contributions Welcome milestone Sep 5, 2018

chris-b1 mentioned this issue Sep 18, 2018

BUG: read_table crashes Python on surrogates #22748

Closed

github-actions bot assigned roberthdevries Mar 14, 2020

roberthdevries mentioned this issue Mar 14, 2020

BUG: Add errors argument to to_csv() call to enable error handling for encoders #32702

Merged

5 tasks

jreback added the Unicode Unicode strings label Mar 16, 2020

jreback modified the milestones: Contributions Welcome, 1.1 Mar 19, 2020

roberthdevries mentioned this issue Mar 24, 2020

Suppress UnicodeEncodeError when executing to_csv method #27750

Closed

mroeschke added the Bug label Apr 14, 2020

WillAyd closed this as completed in #32702 Jun 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_csv() surrogates not allowed #22610

to_csv() surrogates not allowed #22610

obilodeau commented Sep 5, 2018

obilodeau commented Sep 5, 2018

WillAyd commented Sep 5, 2018

obilodeau commented Sep 5, 2018

WillAyd commented Sep 5, 2018

obilodeau commented Sep 5, 2018

hartwork commented Feb 20, 2020

obilodeau commented Feb 20, 2020

roberthdevries commented Mar 14, 2020

to_csv() surrogates not allowed #22610

to_csv() surrogates not allowed #22610

Comments

obilodeau commented Sep 5, 2018

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

obilodeau commented Sep 5, 2018

WillAyd commented Sep 5, 2018

obilodeau commented Sep 5, 2018

WillAyd commented Sep 5, 2018

obilodeau commented Sep 5, 2018

hartwork commented Feb 20, 2020

obilodeau commented Feb 20, 2020

roberthdevries commented Mar 14, 2020

Output of `pd.show_versions()`