-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv should default to index_col = 0 #24468
Comments
At this point, both of those defaults are unlikely to be changed.
Because it is schema-less, csv is never a particularly safe format for round-tripping, consider using something binary like parquet of HDF5 instead |
Agreed. Though might entertain this option more in a super-breaking release like |
This has been rejected before in #12627 with the following reply:
However, I can only interpret the other argument (similar to @chris-b1's reply) to be "because legacy". I personally consider this to be a poor argument in any case. Defaults have been changed before, is there a specific reason this one shouldn't?
Agreed. However, while CSV may not be the best data format for round-tripping, in practice this is one of the most common use-cases. Many new users are confused as to why unnamed columns appear in their files seemingly at random. After years of using Pandas, I still regularly forget to set the right arguments to allow round-tripping. In my opinion, the inconsistency of the current defaults only adds unnecessary cognitive load. I think this is a good issue for v0.25. |
just personally, I'd be more sympathetic to changing the default on |
Agreed. But the point is well taken that we should pick a suitable (and consistent) default for both to avoid the confusions described above. That being said, it would be good to get some more opinions on what people generally use in the wild (i.e. with an index, without) before settling on one.
We aren't saying that it shouldn't, but asking for us to do this in |
Sorry, I didn't mean to rush it. I'm not very familiar with the Pandas release cycle. As long as it's not pushed back to
I'm leaning towards this, too, if nothing else because it would match what most users I know already do. |
Agreed with #24468 (comment) that this is too large of a change for us at this point. What do you think @itko? |
I tend to agree with @JeroenDelcour. Not only have I seen multiple users get confused by the appearance of unnamed columns, I myself often forget to set the index arguments. It seems to me we should be prioritizing usability and intuitiveness. Can't we schedule this for release and add a FutureWarning? |
Code Sample
Problem description
Currently, the default CSV writing behaviour is to write the index column. The default reading behaviour, however, is to assume there is no index column in the file. This is not intuitive when writing and reading files.
The expected behaviour is that a file which is written without any index option it should be able to be read without any index option. However, the default writing and reading behaviour results in an unnamed column.
One fix is to change the default index_col for read_csv, another option is to change the default index boolean for to_csv. The former is probably preferable as it preserves information.
The text was updated successfully, but these errors were encountered: