read_csv should default to index_col = 0 #24468

itko · 2018-12-28T17:15:35Z

Code Sample

df.to_csv(file_path)
df = pd.read_csv(file_path)

Problem description

Currently, the default CSV writing behaviour is to write the index column. The default reading behaviour, however, is to assume there is no index column in the file. This is not intuitive when writing and reading files.

The expected behaviour is that a file which is written without any index option it should be able to be read without any index option. However, the default writing and reading behaviour results in an unnamed column.

One fix is to change the default index_col for read_csv, another option is to change the default index boolean for to_csv. The former is probably preferable as it preserves information.

chris-b1 · 2018-12-28T18:33:19Z

At this point, both of those defaults are unlikely to be changed.

pd.DataFrame.from_csv exists for easier round-tripping, though deprecated, see e.g. #10163

Because it is schema-less, csv is never a particularly safe format for round-tripping, consider using something binary like parquet of HDF5 instead

gfyoung · 2018-12-30T05:06:06Z

At this point, both of those defaults are unlikely to be changed.

Agreed. Though might entertain this option more in a super-breaking release like 1.0.

JeroenDelcour · 2019-03-22T16:20:36Z

This has been rejected before in #12627 with the following reply:

Has been this way almost since the beginning.

The idea is that .to_csv and .from_csv are inverses

Essentially impossible to change at this point. But to be honest its actually a sensible default. Indexes are more and more important. If you are not using them you should.

However, .from_csv has been deprecated in favor of .read_csv, which would only be the inverse of .to_csv if this default were changed.

I can only interpret the other argument (similar to @chris-b1's reply) to be "because legacy". I personally consider this to be a poor argument in any case. Defaults have been changed before, is there a specific reason this one shouldn't?

Because it is schema-less, csv is never a particularly safe format for round-tripping, consider using something binary like parquet of HDF5 instead

Agreed. However, while CSV may not be the best data format for round-tripping, in practice this is one of the most common use-cases. Many new users are confused as to why unnamed columns appear in their files seemingly at random. After years of using Pandas, I still regularly forget to set the right arguments to allow round-tripping. In my opinion, the inconsistency of the current defaults only adds unnecessary cognitive load.

I think this is a good issue for v0.25.

chris-b1 · 2019-03-22T16:39:15Z

just personally, I'd be more sympathetic to changing the default on to_csv to index=False, but that has its own set of problems

gfyoung · 2019-03-22T18:48:43Z

but that has its own set of problems

Agreed. But the point is well taken that we should pick a suitable (and consistent) default for both to avoid the confusions described above. That being said, it would be good to get some more opinions on what people generally use in the wild (i.e. with an index, without) before settling on one.

Defaults have been changed before, is there a specific reason this one shouldn't?

We aren't saying that it shouldn't, but asking for us to do this in 0.25.0 is rushing things IMO.

JeroenDelcour · 2019-03-22T19:01:56Z

asking for us to do this in 0.25.0 is rushing things IMO.

Sorry, I didn't mean to rush it. I'm not very familiar with the Pandas release cycle. As long as it's not pushed back to 1.0.0 - I don't think it's that breaking.

just personally, I'd be more sympathetic to changing the default on to_csv to index=False, but that has its own set of problems

I'm leaning towards this, too, if nothing else because it would match what most users I know already do.

TomAugspurger · 2019-12-11T18:37:46Z

Agreed with #24468 (comment) that this is too large of a change for us at this point. What do you think @itko?

itko · 2019-12-11T23:12:00Z

I tend to agree with @JeroenDelcour. Not only have I seen multiple users get confused by the appearance of unnamed columns, I myself often forget to set the index arguments.

It seems to me we should be prioritizing usability and intuitiveness. Can't we schedule this for release and add a FutureWarning?

gfyoung added API Design IO CSV read_csv, to_csv labels Dec 30, 2018

WillAyd mentioned this issue May 20, 2019

to_csv with UTF16 Incorrectly Treats BOM as column #26446

Closed

phofl mentioned this issue Nov 27, 2021

ENH: Different behavior of pandas when saving and restoring from a CSV file #44639

Closed

twoertwein mentioned this issue Apr 1, 2022

change the index in DataFrame.to_csv to False as default? #46583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv should default to index_col = 0 #24468

read_csv should default to index_col = 0 #24468

itko commented Dec 28, 2018 •

edited

Loading

chris-b1 commented Dec 28, 2018

gfyoung commented Dec 30, 2018

JeroenDelcour commented Mar 22, 2019

chris-b1 commented Mar 22, 2019

gfyoung commented Mar 22, 2019

JeroenDelcour commented Mar 22, 2019

TomAugspurger commented Dec 11, 2019

itko commented Dec 11, 2019

read_csv should default to index_col = 0 #24468

read_csv should default to index_col = 0 #24468

Comments

itko commented Dec 28, 2018 • edited Loading

Code Sample

Problem description

chris-b1 commented Dec 28, 2018

gfyoung commented Dec 30, 2018

JeroenDelcour commented Mar 22, 2019

chris-b1 commented Mar 22, 2019

gfyoung commented Mar 22, 2019

JeroenDelcour commented Mar 22, 2019

TomAugspurger commented Dec 11, 2019

itko commented Dec 11, 2019

itko commented Dec 28, 2018 •

edited

Loading