Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv should default to index_col = 0 #24468

Open
itko opened this issue Dec 28, 2018 · 8 comments
Open

read_csv should default to index_col = 0 #24468

itko opened this issue Dec 28, 2018 · 8 comments
Labels
Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action

Comments

@itko
Copy link

itko commented Dec 28, 2018

Code Sample

df.to_csv(file_path)
df = pd.read_csv(file_path)

Problem description

Currently, the default CSV writing behaviour is to write the index column. The default reading behaviour, however, is to assume there is no index column in the file. This is not intuitive when writing and reading files.

The expected behaviour is that a file which is written without any index option it should be able to be read without any index option. However, the default writing and reading behaviour results in an unnamed column.

One fix is to change the default index_col for read_csv, another option is to change the default index boolean for to_csv. The former is probably preferable as it preserves information.

@chris-b1
Copy link
Contributor

At this point, both of those defaults are unlikely to be changed.

pd.DataFrame.from_csv exists for easier round-tripping, though deprecated, see e.g. #10163

Because it is schema-less, csv is never a particularly safe format for round-tripping, consider using something binary like parquet of HDF5 instead

@gfyoung gfyoung added API Design IO CSV read_csv, to_csv labels Dec 30, 2018
@gfyoung
Copy link
Member

gfyoung commented Dec 30, 2018

At this point, both of those defaults are unlikely to be changed.

Agreed. Though might entertain this option more in a super-breaking release like 1.0.

@JeroenDelcour
Copy link

This has been rejected before in #12627 with the following reply:

Has been this way almost since the beginning.

The idea is that .to_csv and .from_csv are inverses

Essentially impossible to change at this point. But to be honest its actually a sensible default. Indexes are more and more important. If you are not using them you should.

However, .from_csv has been deprecated in favor of .read_csv, which would only be the inverse of .to_csv if this default were changed.

I can only interpret the other argument (similar to @chris-b1's reply) to be "because legacy". I personally consider this to be a poor argument in any case. Defaults have been changed before, is there a specific reason this one shouldn't?

Because it is schema-less, csv is never a particularly safe format for round-tripping, consider using something binary like parquet of HDF5 instead

Agreed. However, while CSV may not be the best data format for round-tripping, in practice this is one of the most common use-cases. Many new users are confused as to why unnamed columns appear in their files seemingly at random. After years of using Pandas, I still regularly forget to set the right arguments to allow round-tripping. In my opinion, the inconsistency of the current defaults only adds unnecessary cognitive load.

I think this is a good issue for v0.25.

@chris-b1
Copy link
Contributor

just personally, I'd be more sympathetic to changing the default on to_csv to index=False, but that has its own set of problems

@gfyoung
Copy link
Member

gfyoung commented Mar 22, 2019

but that has its own set of problems

Agreed. But the point is well taken that we should pick a suitable (and consistent) default for both to avoid the confusions described above. That being said, it would be good to get some more opinions on what people generally use in the wild (i.e. with an index, without) before settling on one.

Defaults have been changed before, is there a specific reason this one shouldn't?

We aren't saying that it shouldn't, but asking for us to do this in 0.25.0 is rushing things IMO.

@JeroenDelcour
Copy link

asking for us to do this in 0.25.0 is rushing things IMO.

Sorry, I didn't mean to rush it. I'm not very familiar with the Pandas release cycle. As long as it's not pushed back to 1.0.0 - I don't think it's that breaking.

just personally, I'd be more sympathetic to changing the default on to_csv to index=False, but that has its own set of problems

I'm leaning towards this, too, if nothing else because it would match what most users I know already do.

@TomAugspurger
Copy link
Contributor

Agreed with #24468 (comment) that this is too large of a change for us at this point. What do you think @itko?

@itko
Copy link
Author

itko commented Dec 11, 2019

I tend to agree with @JeroenDelcour. Not only have I seen multiple users get confused by the appearance of unnamed columns, I myself often forget to set the index arguments.

It seems to me we should be prioritizing usability and intuitiveness. Can't we schedule this for release and add a FutureWarning?

@mroeschke mroeschke added Enhancement Warnings Warnings that appear or should be added to pandas Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action and removed API Design Enhancement Warnings Warnings that appear or should be added to pandas labels Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants