Request for DataFrame.to_tsv() for reading tab delimited text #10327

dhimmel · 2015-06-11T00:22:03Z

I propose a function, which can be called on a DataFrame, named to_tsv or to_table. The function is the equivalent of to_csv() with the argument sep='\t'. While to_tsv() contains the functionality to write tsv files, I find it annoying to always have to specify an additional argument. I prefer tsv files to csv files because tabs more rarely occur and therefore decrease the need for escaping. I also find the plain-text rendering more readable. I worry that the lack of a dedicated to_tsv() function encourages the use of csv over tsv. Currently read_table() defaults to tab separators, but there is no equivalent function for writing.

The text was updated successfully, but these errors were encountered:

dhimmel · 2015-09-22T01:46:23Z

In addition to being just to_csv(sep='\t'), a to_tsv function should consider changing the default quoting, since quoting is less necessary for tsv files.

shoyer · 2015-09-22T07:50:05Z

The pandas API is already cluttered with an excess of rarely used convenience methods. I really don't think adding another one is a good idea.

jorisvandenbossche · 2015-09-22T08:52:02Z

I agree with @shoyer here. All functionality is there to do this within to_csv, and given we already have many methods, I think the reason to add a new one should be stronger than being able to provide other defaults.

I am closing this (we have too many open issues ..), but discussion can certainly continue if needed.

DSLituiev · 2016-05-29T03:13:01Z

+1. As practitioner, I would highly appreciate to_tab sugar.

dhimmel · 2016-05-31T18:47:17Z

I think the reason to add a new one should be stronger than being able to provide other defaults.

IMO convenience is worthy justification (people tend to write many text files, so to_csv has to constantly be supplemented with parameters).

However, my main motivation is a disdain for the CSV format. It pains me to see people still using CSV over TSV. Obviously excel/database support has a role to play. But a project like pandas should strive to make the best practices the easiest to implement.

DSLituiev · 2016-05-31T18:57:52Z

though this not a major issue for me currently, csv is US/Commonwealth representation-centric and internationally unaware. With all Pythonic philosophy of acceptance of UTF and internationalization, tab-separated must be preferred over csv / semicolon-sv.

stevekm · 2016-07-18T15:10:11Z

While I can understand the sentiment put forward by @shoyer, I agree with @dhimmel. It is my experience that TSV is much more of a standard format for data analysis than CSV. There are many use cases where the TSV format is a requirement, whereas I am not familiar with any for CSV format (there a couple examples of common usages here). TSV also has an advantage in that the raw text is easily readable, and avoids the issues with quoting as mentioned by @dhimmel.

shoyer · 2016-07-18T16:06:29Z

I am only slightly opposed to adding to_tsv. In my experience (in the US) CSV is more common than TSV (at least at the name for the file format), but only slightly. The main virtue to_tsv has going for it is that the name makes it instantly clear what it does.

dhimmel · 2016-07-18T16:16:23Z

CSV and TSV are both well supported and widely used in data science. CSV is more of a legacy format, thus many backwards-focused projects default to CSV. However, I think forwards-focused projects should default to TSV, as it's better for data science. Since there is no default to_text_delimited_file output function in pandas, to_csv is the de facto default. Since most users don't care enough to manually specify sep='\t', pandas is contributing to the prevalence of CSVs over TSVs and delaying the rise of the superior format.

Starkiller4011 · 2017-06-05T14:53:38Z

Please excuse my ignorance on the matter but apart from being easier to read as a human, if and only if the column headers have roughly the same characters as their corresponding data which is not always the case, what advantages does TSV provide over CSV? Honestly curious if there is a performance difference between the two, I use TSV right now but honestly only because the data files I am working with came in that format so I left them in the same format.

dhimmel · 2017-06-05T15:17:49Z

what advantages does TSV provide over CSV?

@Starkiller4011 tabs are a more natural separator for columnar data. They require less quoting, since values rarely contain tabs but often contain commas.

Honestly curious if there is a performance difference between the two

I'd expect the performance difference is trivial. However, like most things in data science, the real type of performance that matters is programmer efficiency. And I think TSVs are nicer to work with than CSVs.

dsm054 · 2017-06-05T15:18:38Z

Not everyone agrees that tab separation is superior to csv -- I don't, for example.

As Python programmers, we know that whitespace isn't always preserved across different operations, like copying and pasting. Those of us who answer a lot of questions on SO, for example, regularly have to use sep="\s\s+" to parse text which people have dumped in a whitespace-separated format, and we have to hope they've put enough spaces between columns for that to work. If they were using commas, or semicolons, or pipes, or something, this wouldn't be a problem. (And I just thought of carats, which used to be used pretty widely in some fields.)

If we want to add a to_tsv alias to make some people happier, okay. But let's not pretend that TSV doesn't have its own headaches when you're working with it, and the only advantage I can think of is less quoting.

stevekm · 2017-06-05T15:26:59Z

I think it's worth taking a step back and recognizing that a function like to_csv is kind of silly, the solution should be a more generic to_table function which requires a delimiter to be specified, and which to_csv is just a convenience wrapper around. R has this functionality in it's write.table() function, which makes more sense.

dhimmel · 2017-06-05T15:37:16Z

For the record, I think both CSV and TSV and acceptable and good formats. They should both be supported. @dsm054 brings up some compelling advantages to non-whitespace delimiters.

A bigger issue in my opinion is using the .csv extension indiscriminately (e.g. when referring to TSVs). See discussion at #14587. I agree with @stevekm that to_table should be a generic function where you should specify your delimiter, while to_csv or to_tsv should focus on those standards. Going about this in a backwards compatible would take some forethought. But at least pandas 2 should consider function names along the lines of readr.

marcora · 2018-12-14T17:45:57Z

Just starting to use pandas dataframes coming from R + tidyverse/readr and first thing I was negatively impressed by is the lack of consistent read/write methods like:

read_csv()/write_csv(): comma separated (CSV) files
read_tsv()/write_tsv(): tab separated files
read_delim()/write_delim(): general delimited files
read_fwf()/write_fwf(): fixed width files
read_table()/write_table(): tabular files where colums are separated by white-space.
read_log()/write_log(): web log files

In 20 years doing data science in genomics I never encountered a csv file, most data exists in tsv (or white-space delimited) format. Having to specify sep and quoting argument using df.to_csv() to write a tsv (or white-space delimited) file is inconvenient to say the least.

Having df.read_tsv() df.to_tsv() for tab-delimited files and df.read_table() df.to_table() for white-space delimited files would be very helpful for people coming to pandas from R.

dhimmel · 2019-01-29T20:32:45Z

As of pandas 0.24, read_table is now deprecated (see #21948 / #21954). Since I've been using read_table as a substitute for the lack of read_tsv, I am now getting many:

FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.

On the plus side, removing read_table does make it more straightforward to add both read_tsv and to_tsv functions, although is the tide turning against convenience functions as per #18262?

sukjun-kim · 2023-07-11T16:27:10Z

I also strongly agree with @marcora's comment. In genomics, since most data exist in TSV format, there is an inconvenience of having to use a sep= argument whenever using the read_csv() function.

And here is a tip for those of you who can't wait for pandas to support that feature, you can use the following method if needed. Alternatively, you can simply make your own methods by doing this (monkey patching):

from functools import partial, partialmethod
import pandas as pd

pd.read_tsv = partial(pd.read_csv, sep='\t')
pd.DataFrame.to_tsv = partialmethod(pd.DataFrame.to_csv, sep='\t', index=False)

And you can call pd.read_tsv() and df.to_tsv() functions as if they already existed.

# Load a dataframe from TSV file
df = pd.read_tsv(...)

# Write a dataframe to TSV file
df.to_tsv(...)

dhimmel mentioned this issue Jun 18, 2015

Identifying the subset of human relevant terms obophenotype/uberon#703

Closed

jorisvandenbossche closed this as completed Sep 22, 2015

jorisvandenbossche added the IO CSV read_csv, to_csv label Sep 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for DataFrame.to_tsv() for reading tab delimited text #10327

Request for DataFrame.to_tsv() for reading tab delimited text #10327

dhimmel commented Jun 11, 2015

dhimmel commented Sep 22, 2015

shoyer commented Sep 22, 2015

jorisvandenbossche commented Sep 22, 2015

DSLituiev commented May 29, 2016

dhimmel commented May 31, 2016

DSLituiev commented May 31, 2016

stevekm commented Jul 18, 2016

shoyer commented Jul 18, 2016

dhimmel commented Jul 18, 2016

Starkiller4011 commented Jun 5, 2017

dhimmel commented Jun 5, 2017 •

edited

Loading

dsm054 commented Jun 5, 2017 •

edited

Loading

stevekm commented Jun 5, 2017

dhimmel commented Jun 5, 2017 •

edited

Loading

marcora commented Dec 14, 2018

dhimmel commented Jan 29, 2019 •

edited

Loading

sukjun-kim commented Jul 11, 2023 •

edited

Loading

Request for DataFrame.to_tsv() for reading tab delimited text #10327

Request for DataFrame.to_tsv() for reading tab delimited text #10327

Comments

dhimmel commented Jun 11, 2015

dhimmel commented Sep 22, 2015

shoyer commented Sep 22, 2015

jorisvandenbossche commented Sep 22, 2015

DSLituiev commented May 29, 2016

dhimmel commented May 31, 2016

DSLituiev commented May 31, 2016

stevekm commented Jul 18, 2016

shoyer commented Jul 18, 2016

dhimmel commented Jul 18, 2016

Starkiller4011 commented Jun 5, 2017

dhimmel commented Jun 5, 2017 • edited Loading

dsm054 commented Jun 5, 2017 • edited Loading

stevekm commented Jun 5, 2017

dhimmel commented Jun 5, 2017 • edited Loading

marcora commented Dec 14, 2018

dhimmel commented Jan 29, 2019 • edited Loading

sukjun-kim commented Jul 11, 2023 • edited Loading

dhimmel commented Jun 5, 2017 •

edited

Loading

dsm054 commented Jun 5, 2017 •

edited

Loading

dhimmel commented Jun 5, 2017 •

edited

Loading

dhimmel commented Jan 29, 2019 •

edited

Loading

sukjun-kim commented Jul 11, 2023 •

edited

Loading