Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_csv() surrogates not allowed #22610

Closed
obilodeau opened this issue Sep 5, 2018 · 8 comments · Fixed by #32702
Closed

to_csv() surrogates not allowed #22610

obilodeau opened this issue Sep 5, 2018 · 8 comments · Fixed by #32702
Assignees
Labels
Bug IO CSV read_csv, to_csv Unicode Unicode strings
Milestone

Comments

@obilodeau
Copy link
Contributor

Code Sample

s = '\ud800'
srs = pd.Series()
srs.loc[ 0 ] = s
srs.to_csv('testcase.csv')

Stack trace:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-50-769583baba38> in <module>()
      4 srs = pd.Series()
      5 srs.loc[ 0 ] = s
----> 6 srs.to_csv('testcase.csv')

/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in to_csv(self, path, index, sep, na_rep, float_format, header, index_label, mode, encoding, compression, date_format, decimal)
   3779                            index_label=index_label, mode=mode,
   3780                            encoding=encoding, compression=compression,
-> 3781                            date_format=date_format, decimal=decimal)
   3782         if path is None:
   3783             return result

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1743                                  doublequote=doublequote,
   1744                                  escapechar=escapechar, decimal=decimal)
-> 1745         formatter.save()
   1746 
   1747         if path_or_buf is None:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
    169                 self.writer = UnicodeWriter(f, **writer_kwargs)
    170 
--> 171             self._save()
    172 
    173         finally:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save(self)
    284                 break
    285 
--> 286             self._save_chunk(start_i, end_i)
    287 
    288     def _save_chunk(self, start_i, end_i):

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save_chunk(self, start_i, end_i)
    311 
    312         libwriters.write_csv_rows(self.data, ix, self.nlevels,
--> 313                                   self.cols, self.writer)

pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 2: surrogates not allowed

Problem description

The presence of Unicode surrogates in a dataframe (or Series) causes an error in .to_csv(). This has already been fixed in .to_hdf() by allowing the errors= argument to be used where we can use the surrogatepass or surrogateescape error handler.

See the original bug report and the PR that fixed it.

Expected Output

No error.

Output of pd.show_versions()

I forgot to grab this before the end of my workshop and I destroyed the cloud instance. Sorry. It was Python 3.6 and pandas 0.23.4 I think.

@obilodeau
Copy link
Contributor Author

Forgot to say that the workaround is to make sure you have no UTF-8 surrogates in your data. In my case this meant that I needed to decode / reencode a field that came from another library.

For example:

field = field.encode('utf-8', errors='surrogatepass')

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Sep 5, 2018
@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2018

Is this possible to do with the stdlib csv writer?

@obilodeau
Copy link
Contributor Author

This (plain open):

import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w") as _file:
   writer = csv.writer(_file)
   writer.writerow(row)

will yield the error below:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-12-c276abf97bef> in <module>()
      3 with open("test-you-can-delete.csv", "w") as _file:
      4    writer = csv.writer(_file)
----> 5    writer.writerow(row)

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

But open() supports passing a codec error handler with the errors= named argument:

import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w", errors='surrogatepass') as _file:
   writer = csv.writer(_file)
   writer.writerow(row)

This doesn't generate an error.

Implementing the named argument errors= in to_csv() satisfies the principle of least surprise. To me, this is the way to go. Having to explain why all fields should be re-encoded with encode() before using to_csv() while everything else worked without it (and used to work without it before) was a painful moment for young data scientists.

@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2018

Makes sense - would accept a PR if you are up for it

@WillAyd WillAyd added this to the Contributions Welcome milestone Sep 5, 2018
@obilodeau
Copy link
Contributor Author

It's a busy time of year but I might get back at it later.

In the meantime, if anyone else is interested at the problem, it is documented with a workaround and a link to a similar fix in to_hdf() for inspiration.

@hartwork
Copy link

Forgot to say that the workaround is to make sure you have no UTF-8 surrogates in your data. In my case this meant that I needed to decode / reencode a field that came from another library.

For example:

field = field.encode('utf-8', errors='surrogatepass')

I would like to point out that this approach can produce malformed UTF-8 so I'm not sure if that's a good path forward. For proof:

In [2]: list(sys.version_info)                                                                             
Out[2]: [3, 6, 10, 'final', 0]

In [3]: '\ud800'.encode('utf-8', errors='surrogatepass').decode('utf-8')                                   
[..]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

@obilodeau
Copy link
Contributor Author

Of course, you should use the error handler that fits your need based on context: https://docs.python.org/3/library/codecs.html#error-handlers

I could have used surrogateescape or replace too because in my context the UTF-8 invalid content was garbage and could be discarded. Few instances of it in large dataframes were preventing a csv dump which was what was annoying and unintuitive to inexperienced python and pandas programmers.

If I still had the original data, I would update my workaround above but I can't 100% confirm that using surrogateescape works, unfortunately.

@roberthdevries
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
6 participants