Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conform Series.to_csv to DataFrame.to_csv #19715

Closed
dahlbaek opened this issue Feb 15, 2018 · 8 comments · Fixed by #21896
Closed

Conform Series.to_csv to DataFrame.to_csv #19715

dahlbaek opened this issue Feb 15, 2018 · 8 comments · Fixed by #21896
Labels
Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv

Comments

@dahlbaek
Copy link
Contributor

Thank you for this awesome package!

Problem description

When indexing/selecting/slicing a DataFrame, pandas may return a Series. Both classes support the to_csv method, but the methods have subtle differences. These subtle differences may break your code, or demand a call to to_frame. Perhaps it would be beneficial to conform the Series.to_csv method to the DataFrame.to_csv method?

At least, the following two examples seem relevant (and are the reason I came here):

  • DataFrame.to_csv accepts the keyword parameter path_or_buf. This seems to be completely analogous to the differently named keyword parameter path of Series.to_csv.
  • DataFrame.to_csv accepts the keyword parameter line_terminator. There does not seem to be an analogous keyword parameter of Series.to_csv.

Addendum: Other differences

The following two keyword parameters are implemented for DataFrame.to_csv, but not for Series.to_csv: compression, chunksize. I suppose these are mostly relevant if you have very large dataframes, but I do not see any reason why they should not have analogous methods for Series.

Some of the keyword parameters of DataFrame.to_csv are specific to multi-column data and may therefore not make immediate sense for Series.to_csv. On the other hand, a user may wish to use quoting in a single-column file in order to maintain conformity with other parts of a bigger project. In this relation, the following keyword parameters are implemented for DataFrame.to_csv but do not for Series.to_csv: quoting, quotechar,doublequote, escapechar.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-32-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL:
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

@jorisvandenbossche jorisvandenbossche added the IO CSV read_csv, to_csv label Feb 15, 2018
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 15, 2018

@dahlbaek Thanks for opening this issue. I think this all sounds very sensible, and, should also not be too hard to solve.
As currently Series.to_csv just creates a dataframe and then does DataFrame.to_csv:

pandas/pandas/core/series.py

Lines 2924 to 2932 in 2fdf1e2

df = DataFrame(self)
# result is only a string if no path provided, otherwise None
result = df.to_csv(path, index=index, sep=sep, na_rep=na_rep,
float_format=float_format, header=header,
index_label=index_label, mode=mode,
encoding=encoding, compression=compression,
date_format=date_format, decimal=decimal)
if path is None:
return result

So I think we could simply pass through **kwargs to DataFrame.to_csv

@dahlbaek
Copy link
Contributor Author

I see! If you would like, I would be happy to try my hand at a pull request.

@jorisvandenbossche
Copy link
Member

Sure!

@gfyoung
Copy link
Member

gfyoung commented Feb 15, 2018

xref #18958 : I 100% agree that this should be done.

@gfyoung gfyoung added the Compat pandas objects compatability with Numpy or Python functions label Feb 15, 2018
@jreback jreback added this to the 0.23.0 milestone Feb 18, 2018
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.0, Next Major Release Mar 29, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Jul 13, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Jul 13, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 13, 2018

@jreback @jorisvandenbossche : Three PR's for resolving this issue:

#19745 (@dahlbaek)
#21868 (@gfyoung)
#21896 (@toobaz)

All of them have different implementations. Thoughts?

toobaz added a commit to toobaz/pandas that referenced this issue Jul 13, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Jul 14, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Jul 14, 2018
@dahlbaek
Copy link
Contributor Author

dahlbaek commented Jul 14, 2018

As far as I understand, the solution by @gfyoung (#21868) takes care of everything except the reordering of positional arguments. As pointed out by @toobaz here (see #21896 for proof of concept), the ordering problem can be solved by noting that index should be True or False while sep must be a one-character string. That is, one can determine whether the user is using the old or new signature by checking the type of sep (which is the first positional argument following path_or_buf in the new signature).

However, as of right now, index also accepts one-character strings (evaluating to True), which means there may be code out there passing one-character strings to index. I wonder if it would be possible to pave the way for the solution to this problem by first having a release which forces users to pass in boolean valued arguments to index?

If one were to first implement something like

if not isinstance(index, bool):
    raise ValueError(
        "{0} is not a valid value for index. "
        "Use True, False or bool({0}) instead."
        .format(repr(index))
    )

in Series.to_csv, then the solution of @toobaz would be ensured to correctly classify which signature the user had in mind. But maybe this is overthinking it.

@toobaz
Copy link
Member

toobaz commented Jul 14, 2018

then the solution of @toobaz would be ensured to correctly classify which signature the user had in mind

Doesn't it already? (see last rebase, and my reply to your comment)

@toobaz
Copy link
Member

toobaz commented Jul 14, 2018

Doesn't it already? (see last rebase, and my reply to your comment)

No, it does not, sorry, the ambiguity is unavoidable.

If one were to first implement something like

No, I really don't think we want to do this - not worth the effort (the "y" behavior is undocumented).

toobaz added a commit to toobaz/pandas that referenced this issue Jul 25, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Aug 2, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Aug 2, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv
Projects
None yet
5 participants