Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for DataFrame.to_tsv() for reading tab delimited text #10327

Closed
dhimmel opened this issue Jun 11, 2015 · 17 comments
Closed

Request for DataFrame.to_tsv() for reading tab delimited text #10327

dhimmel opened this issue Jun 11, 2015 · 17 comments
Labels
IO CSV read_csv, to_csv

Comments

@dhimmel
Copy link
Contributor

dhimmel commented Jun 11, 2015

I propose a function, which can be called on a DataFrame, named to_tsv or to_table. The function is the equivalent of to_csv() with the argument sep='\t'. While to_tsv() contains the functionality to write tsv files, I find it annoying to always have to specify an additional argument. I prefer tsv files to csv files because tabs more rarely occur and therefore decrease the need for escaping. I also find the plain-text rendering more readable. I worry that the lack of a dedicated to_tsv() function encourages the use of csv over tsv. Currently read_table() defaults to tab separators, but there is no equivalent function for writing.

@dhimmel
Copy link
Contributor Author

dhimmel commented Sep 22, 2015

In addition to being just to_csv(sep='\t'), a to_tsv function should consider changing the default quoting, since quoting is less necessary for tsv files.

@shoyer
Copy link
Member

shoyer commented Sep 22, 2015

The pandas API is already cluttered with an excess of rarely used convenience methods. I really don't think adding another one is a good idea.

@jorisvandenbossche
Copy link
Member

I agree with @shoyer here. All functionality is there to do this within to_csv, and given we already have many methods, I think the reason to add a new one should be stronger than being able to provide other defaults.

I am closing this (we have too many open issues ..), but discussion can certainly continue if needed.

@jorisvandenbossche jorisvandenbossche added the IO CSV read_csv, to_csv label Sep 22, 2015
@DSLituiev
Copy link

+1. As practitioner, I would highly appreciate to_tab sugar.

@dhimmel
Copy link
Contributor Author

dhimmel commented May 31, 2016

I think the reason to add a new one should be stronger than being able to provide other defaults.

IMO convenience is worthy justification (people tend to write many text files, so to_csv has to constantly be supplemented with parameters).

However, my main motivation is a disdain for the CSV format. It pains me to see people still using CSV over TSV. Obviously excel/database support has a role to play. But a project like pandas should strive to make the best practices the easiest to implement.

@DSLituiev
Copy link

though this not a major issue for me currently, csv is US/Commonwealth representation-centric and internationally unaware. With all Pythonic philosophy of acceptance of UTF and internationalization, tab-separated must be preferred over csv / semicolon-sv.

@stevekm
Copy link

stevekm commented Jul 18, 2016

While I can understand the sentiment put forward by @shoyer, I agree with @dhimmel. It is my experience that TSV is much more of a standard format for data analysis than CSV. There are many use cases where the TSV format is a requirement, whereas I am not familiar with any for CSV format (there a couple examples of common usages here). TSV also has an advantage in that the raw text is easily readable, and avoids the issues with quoting as mentioned by @dhimmel.

@shoyer
Copy link
Member

shoyer commented Jul 18, 2016

I am only slightly opposed to adding to_tsv. In my experience (in the US) CSV is more common than TSV (at least at the name for the file format), but only slightly. The main virtue to_tsv has going for it is that the name makes it instantly clear what it does.

@dhimmel
Copy link
Contributor Author

dhimmel commented Jul 18, 2016

CSV and TSV are both well supported and widely used in data science. CSV is more of a legacy format, thus many backwards-focused projects default to CSV. However, I think forwards-focused projects should default to TSV, as it's better for data science. Since there is no default to_text_delimited_file output function in pandas, to_csv is the de facto default. Since most users don't care enough to manually specify sep='\t', pandas is contributing to the prevalence of CSVs over TSVs and delaying the rise of the superior format.

@Starkiller4011
Copy link

Please excuse my ignorance on the matter but apart from being easier to read as a human, if and only if the column headers have roughly the same characters as their corresponding data which is not always the case, what advantages does TSV provide over CSV? Honestly curious if there is a performance difference between the two, I use TSV right now but honestly only because the data files I am working with came in that format so I left them in the same format.

@dhimmel
Copy link
Contributor Author

dhimmel commented Jun 5, 2017

what advantages does TSV provide over CSV?

@Starkiller4011 tabs are a more natural separator for columnar data. They require less quoting, since values rarely contain tabs but often contain commas.

Honestly curious if there is a performance difference between the two

I'd expect the performance difference is trivial. However, like most things in data science, the real type of performance that matters is programmer efficiency. And I think TSVs are nicer to work with than CSVs.

@dsm054
Copy link
Contributor

dsm054 commented Jun 5, 2017

Not everyone agrees that tab separation is superior to csv -- I don't, for example.

As Python programmers, we know that whitespace isn't always preserved across different operations, like copying and pasting. Those of us who answer a lot of questions on SO, for example, regularly have to use sep="\s\s+" to parse text which people have dumped in a whitespace-separated format, and we have to hope they've put enough spaces between columns for that to work. If they were using commas, or semicolons, or pipes, or something, this wouldn't be a problem. (And I just thought of carats, which used to be used pretty widely in some fields.)

If we want to add a to_tsv alias to make some people happier, okay. But let's not pretend that TSV doesn't have its own headaches when you're working with it, and the only advantage I can think of is less quoting.

@stevekm
Copy link

stevekm commented Jun 5, 2017

I think it's worth taking a step back and recognizing that a function like to_csv is kind of silly, the solution should be a more generic to_table function which requires a delimiter to be specified, and which to_csv is just a convenience wrapper around. R has this functionality in it's write.table() function, which makes more sense.

@dhimmel
Copy link
Contributor Author

dhimmel commented Jun 5, 2017

For the record, I think both CSV and TSV and acceptable and good formats. They should both be supported. @dsm054 brings up some compelling advantages to non-whitespace delimiters.

A bigger issue in my opinion is using the .csv extension indiscriminately (e.g. when referring to TSVs). See discussion at #14587. I agree with @stevekm that to_table should be a generic function where you should specify your delimiter, while to_csv or to_tsv should focus on those standards. Going about this in a backwards compatible would take some forethought. But at least pandas 2 should consider function names along the lines of readr.

@marcora
Copy link

marcora commented Dec 14, 2018

Just starting to use pandas dataframes coming from R + tidyverse/readr and first thing I was negatively impressed by is the lack of consistent read/write methods like:

read_csv()/write_csv(): comma separated (CSV) files
read_tsv()/write_tsv(): tab separated files
read_delim()/write_delim(): general delimited files
read_fwf()/write_fwf(): fixed width files
read_table()/write_table(): tabular files where colums are separated by white-space.
read_log()/write_log(): web log files

In 20 years doing data science in genomics I never encountered a csv file, most data exists in tsv (or white-space delimited) format. Having to specify sep and quoting argument using df.to_csv() to write a tsv (or white-space delimited) file is inconvenient to say the least.

Having df.read_tsv() df.to_tsv() for tab-delimited files and df.read_table() df.to_table() for white-space delimited files would be very helpful for people coming to pandas from R.

@dhimmel
Copy link
Contributor Author

dhimmel commented Jan 29, 2019

As of pandas 0.24, read_table is now deprecated (see #21948 / #21954). Since I've been using read_table as a substitute for the lack of read_tsv, I am now getting many:

FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.

On the plus side, removing read_table does make it more straightforward to add both read_tsv and to_tsv functions, although is the tide turning against convenience functions as per #18262?

@sukjun-kim
Copy link

sukjun-kim commented Jul 11, 2023

I also strongly agree with @marcora's comment. In genomics, since most data exist in TSV format, there is an inconvenience of having to use a sep= argument whenever using the read_csv() function.

And here is a tip for those of you who can't wait for pandas to support that feature, you can use the following method if needed. Alternatively, you can simply make your own methods by doing this (monkey patching):

from functools import partial, partialmethod
import pandas as pd

pd.read_tsv = partial(pd.read_csv, sep='\t')
pd.DataFrame.to_tsv = partialmethod(pd.DataFrame.to_csv, sep='\t', index=False)

And you can call pd.read_tsv() and df.to_tsv() functions as if they already existed.

# Load a dataframe from TSV file
df = pd.read_tsv(...)

# Write a dataframe to TSV file
df.to_tsv(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

9 participants