Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep (un-deprecate) read_table() for API stability #25220

Closed
st-bender opened this issue Feb 7, 2019 · 34 comments · Fixed by #27102
Closed

Keep (un-deprecate) read_table() for API stability #25220

st-bender opened this issue Feb 7, 2019 · 34 comments · Fixed by #27102
Labels
API Design Blocker Blocking issue or pull request for an upcoming release IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@st-bender
Copy link

st-bender commented Feb 7, 2019

Problem description

read_table() got deprecated in favour of using read_csv().

Using read_csv() to read tab/space delimited files is counter-intuitive. According to the docs and the related issues, both share the same code and it is not clear why the one function should be preferred over the other, and that change may even break existing code.

In my point of view read_table() is the more general function and read_csv() is a special case. Why would you deprecate (and then remove) the more useful function? It is already annoying to use to_csv() to write space/tab delimited files. And as I can see it, it comes down to two lines of code.

Proposed solution

Keep both functions as they are (un-deprecate read_table()) or rename the function to have a more general name like read_txt() (as in numpy.genfromtxt()) or similar.

@jreback
Copy link
Contributor

jreback commented Feb 7, 2019

pls reference the issue where this was deprecated

also you might want to highlite and rebut some arguments there

@st-bender
Copy link
Author

Thanks,
relevant issues are: #18262, #21948 and PR #21954
Tried to add some discussion there, but so far no reaction that's why I opened a new issue.

@jreback
Copy link
Contributor

jreback commented Feb 8, 2019

my point is to summarize the arguments in this issue (for and against)

@gfyoung gfyoung added IO Data IO issues that don't fit into a more specific label API Design Needs Discussion Requires discussion from core team before further action labels Feb 9, 2019
@st-bender
Copy link
Author

Hi,
Thank you for your input and setting the labels and sorry for getting back to you so late.
I am not really sure that I understood correctly, my point is that #18262 only lists a number of API functions to be deprecated, i.e. to be removed in the future, without further reasoning. In my opinion any change to the official and documented API should be justified. After all, pandas calls itself stable.

Note that I am not really opposed to keeping both functions, so the title may be a little provocative and could be adjusted.

My points pro keeping read_table():

  • First of all to keep the API stable and to not break existing code.
  • Keep the function with the more general name (IMHO) over the special case, i.e. read_table() to read ascii files with any delimiter/separator over read_csv() with the connotation of a fixed delimiter, the comma. Renaming that function to read_txt() may be worthwhile considering, although it contradicts the first point above.
  • Two lines of additional code, please correct me if I am mistaken.

Contra (as far as I can see):

  • Two lines of extra code.

@tpall
Copy link

tpall commented Mar 1, 2019

Using read_csv() to import files with any delimiter seems confusing. Regarding naming, read_delim() (as in R readr package) is good alternative to read_table() or read_txt().

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 6, 2019

Keep the function with the more general name (IMHO) over the special case

Keep in mind that the stdlib provides this under the csv module. Shouldn't we be consistent with the stdlib? If you're coming from a Python backgound, wouldn't you look there first?

Two lines of extra code.

All else equal, we prefer not to keep around aliases. They tend to make the library harder to learn and use.

@st-bender
Copy link
Author

Keep the function with the more general name (IMHO) over the special case

Keep in mind that the stdlib provides this under the csv module. Shouldn't we be consistent with the stdlib? If you're coming from a Python backgound, wouldn't you look there first?

Good point, I was actually used to numpy's genfromtxt(), so I was looking for something like that. I am not used to csv (files and module) that much. So in my case that's probably why I missed that connection.

Two lines of extra code.

All else equal, we prefer not to keep around aliases. They tend to make the library harder to learn and use.

Fair enough, and I would agree as a developer. However, you could still refer one to the other and state that they are kept for backwards compatibility. Keep in mind that any part of a stable API that is removed will break code. Not everybody may keep track of changes in the underlying libraries and modules. Imaging someone updates his or her setup at some point in the future for bugfixes and new or improved features. Then he or she notices that one or more of the packages in the setup uses that function (or any other that you are going to remove) which then breaks that package and in turn his own code. Those bugs are nasty to track down, and may lead to people not updating packages or rolling back updates.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 6, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Mar 6, 2019

I know we've had some discussions around read_table which is fine to keep ongoing, but I would be a HUGE -1 on this. I would be inclined to believe that the VAST majority of users use read_csv instead of read_table so doing this would be a nightmare for the sake of semantics.

Here's the top 5 search results for me for "pandas tutorial" excluding pandas own documentation (this is by no means an endorsement of any of these just sharing as reference):

  1. https://www.dataquest.io/blog/pandas-python-tutorial/
  2. https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
  3. https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/
  4. https://data-flair.training/blogs/pandas-tutorial/
  5. https://www.learnpython.org/en/Pandas_Basics

All use read_csv, none use read_table. Again I understand the merits of cleaning up semantics here, but from a pure usage analysis and effort trade-off I would be very against doing this

@st-bender
Copy link
Author

This is just deprecated right now, and will be for some time, to give the community time to catch up.

Isn't the whole point to remove it in the future? Unless I misunderstood the point of deprecating something. So anybody who missed the deprecation warning for whatever reason will be surprised.

Pandas is still evolving rapidly enough that I'd be pleasantly surprised if read_table being dropped was the hardest part of upgrading between two distant versions.

I understand that, but I am tempted to take that bet. I was actually not talking only about read_table() but also about all those other functions that got deprecated recently.
Apparently we have different opinions on 'Development Status :: 5 - Production/Stable'
For me it means that it's save to use and will be compatible for a long time into the future without breaking things. I am thinking about 10 to 20 years or even more.

@st-bender
Copy link
Author

I know we've had some discussions around read_table which is fine to keep ongoing, but I would be a HUGE -1 on this. I would be inclined to believe that the VAST majority of users use read_csv instead of read_table so doing this would be a nightmare for the sake of semantics.

Here's the top 5 search results for me for "pandas tutorial" excluding pandas own documentation (this is by no means an endorsement of any of these just sharing as reference):

  1. https://www.dataquest.io/blog/pandas-python-tutorial/
  2. https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
  3. https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/
  4. https://data-flair.training/blogs/pandas-tutorial/
  5. https://www.learnpython.org/en/Pandas_Basics

All use read_csv, none use read_table. Again I understand the merits of cleaning up semantics here, but from a pure usage analysis and effort trade-off I would be very against doing this

They use read_csv for reading .csv which is what it is designed for and named after. I am talking about reading space/tab delimited files, for which read_csv() is misnamed and not the first place I would look for (and didn't look). The title was meant to provoke discussion, otherwise you may not have stepped in. ;)
The only issue I actually have is changing the API in a backwards-incompatible way, i.e. removing an actually used function from it, which will break other people's code. Granted it may not be the vast majority but it's not negligible either. Yet on the other hand you introduce new interfaces like read_fmf() which does essentially the same, all three cases are handled by numpy's genfromtxt().

But I realize that we are fighting over two lines of code instead of fixing the almost 3000 real bugs. And I can't help you with that unless you really keep your API stable and backwards compatible.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 7, 2019 via email

@jimmywan
Copy link

jimmywan commented Mar 7, 2019

All else equal, we prefer not to keep around aliases. They tend to make the library harder to learn and use.

This makes sense if the aliases are all superfluous and the base function has a reasonable name. You can't reasonably make that argument here.

This is just deprecated right now, and will be for some time, to give the community time to catch up.

So deprecate read_csv and give them a healthy window to catch up. Giving general functions specific names is bad design. If you care at all about having a clean API, the decision to choose read_csv over read_table was absolutely backwards.

I would be inclined to believe that the VAST majority of users use read_csv instead of read_table so doing this would be a nightmare for the sake of semantics.

Arguments like those are akin to saying progress should never be made for the sake of stability. At the very least read_table should be un-deprecated.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 7, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Mar 7, 2019

Note that I am not really opposed to keeping both functions, so the title may be a little provocative and could be adjusted.

The title was meant to provoke discussion, otherwise you may not have stepped in. ;)

I understand there are some frustrations with the deprecation of read_table and that conversation can certainly continue. However, if the request to deprecate read_csv is frivolous than I would at least ask that you update the title to reflect what you are actually asking for (ex: "Undeprecate read_table") instead of what it currently is, as otherwise we'll continue down tangential paths that aren't helpful regardless of anyone's viewpoint on read_table.

But I realize that we are fighting over two lines of code instead of fixing the almost 3000 real bugs. And I can't help you with that unless you really keep your API stable and backwards compatible.

Pandas is an open source tool built off of volunteer hours. We would gladly welcome contributions from you to improve the tool. Otherwise I'd just like to call out that incendiary and insulting comments don't contribute anything to the tool or the community surrounding it.

@jimmywan
Copy link

jimmywan commented Mar 7, 2019

Note that I am not really opposed to keeping both functions, so the title may be a little provocative and could be adjusted.

I understand there are some frustrations with the deprecation of read_table and that conversation can certainly continue. However, if the request to deprecate read_csv is frivolous than I would at least ask that you update the title to reflect what you are actually asking for (ex: "Undeprecate read_table") instead of what it currently is, as otherwise we'll continue down tangential paths that aren't helpful regardless of anyone's viewpoint on read_table.

I'm not the OP, but I think some of us are trying to say that deprecating read_csv in favor of read_table is defensible in terms of API clarity/readability with the side effect of some end-user pain.
Deprecating read_table in favor of read_csv just seems misguided and wrong.

You can't reasonably make that argument here.

Why do you say that? The stdlib uses the "csv" module for reading delimited files.

That's more of a historical artifact and/or implementation detail. It's highly unlikely that a pandas end user knows and/or cares about it. No reason to propagate that decision any further.

Sidenote: The API choices by readr from the tidyverse to make read_csv/read_tsv convenience aliases for the more general read_delim ("delimited") seems like a more consistent choice.
https://readr.tidyverse.org/reference/read_delim.html

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 7, 2019 via email

@jimmywan
Copy link

jimmywan commented Mar 8, 2019

That's more of a historical artifact and/or implementation detail. It's highly unlikely that a pandas end user knows and/or cares about it.

Do you have a source for that?

I'm not sure that I understand your question or the motivation behind it, but...

The original PEP was written > 15 years ago and it's fairly clear that the specific problem they were trying to solve is handling all of the annoying corner cases involved with a poorly defined file format that is ill conceived. Using a delimiter that shows up all over the place in every day life and real world content is just begging to generate corner cases, and that it has, with all of the crazy rules you have to put in place to deal with commas or single/double quotes. There is a passing reference to trying to make it be something slightly more general in support of Excel and excel-generated content, because that's been a tool of choice of real world users for decades. There's really no mention of trying to specifically "name" it to reflect being a general purpose parser building block or anything of that sort.
https://www.python.org/dev/peps/pep-0305/#id13

As mentioned in the PEP, it was largely based upon a library named csv specifically built to parse CSV files:
http://www.object-craft.com.au/projects/csv/

Maybe we just agree to disagree ¯\(ツ)/¯, but none of that really has anything to do with whether we, the "royal" we ;), in 2019, should be following a naming convention created in 2003 for an entirely different problem.

P.S. huge fan of your work with dask/dask-ml!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 8, 2019 via email

@st-bender st-bender changed the title Deprecate read_csv() instead of read_table() Keep (un-deprecate) read_table() for API stability Mar 8, 2019
@st-bender
Copy link
Author

Note that I am not really opposed to keeping both functions, so the title may be a little provocative and could be adjusted.

The title was meant to provoke discussion, otherwise you may not have stepped in. ;)

I understand there are some frustrations with the deprecation of read_table and that conversation can certainly continue. However, if the request to deprecate read_csv is frivolous than I would at least ask that you update the title to reflect what you are actually asking for (ex: "Undeprecate read_table") instead of what it currently is, as otherwise we'll continue down tangential paths that aren't helpful regardless of anyone's viewpoint on read_table.

Did that, is it better now?

But I realize that we are fighting over two lines of code instead of fixing the almost 3000 real bugs. And I can't help you with that unless you really keep your API stable and backwards compatible.

Pandas is an open source tool built off of volunteer hours. We would gladly welcome contributions from you to improve the tool. Otherwise I'd just like to call out that incendiary and insulting comments don't contribute anything to the tool or the community surrounding it.

Sorry, I didn't mean to insult anybody. My problem is that I write code to do my my work and want to make sure that it still works in a few years from now when maybe someone else takes it over, or wants to reproduce my results. I can't do that if the API is constantly changing, or I would have to cap the max versions on the dependencies. So I am looking for solutions that are --- hopefully --- future proof.

@nspies
Copy link
Contributor

nspies commented Mar 13, 2019

Add my vote to keeping read_table around!

I've been a pandas user for 7+ years and have used read_table(...,sep="\t") where necessary. I agree with OP that I far prefer the read_table form as it sounds more general.

I will note that the documentation supports this interpretation: read_table is a function to "Read general delimited file into DataFrame." while read_csv is to "Read a comma-separated values (csv) file into DataFrame."

I have to say in my corner of datasciences, no one has touched the standard csv module, while everyone uses pandas, so I think it should decide for itself what's right. (A better point of comparison is probably R, which has both a read.csv and read.table function, where read.csv is an alias for the more general-purposeread.table.)

I generally agree with minimizing aliases, but this is at the core of what pandas does and is a very widespread alias. While "pd.read_csv" is clearly more commonly used -- I wouldn't suggest getting rid of it! -- a quick unscientific github search yields 50k+ results for "pd.read_table".

@tzuni
Copy link

tzuni commented Apr 2, 2019

In bioinformatics I have to deal with a LOT of tab-delimited data formats and almost never encounter comma-delimited data so this change has made a very prominent impact in my daily use. I too would like to see read_table stick around.

In my view implementing a general case read_delim function would make sense with read_csv and read_table becoming convenience functions to it. This would simplify code maintenance while providing a convenient, more readable, and stable API for two common use-cases.

@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone May 9, 2019
@WillAyd
Copy link
Member

WillAyd commented May 17, 2019

@pandas-dev/pandas-core I know this was talked about during the call yesterday. FWIW I actually like the read_delim base with read_csv and read_table simply wrapping that as described by @jimmywan so I'd be +1 for keeping this around and moving toward that paradigm in the long run

@jorisvandenbossche
Copy link
Member

I just saw this in the enthought cheatsheet passing on Twitter (https://www.enthought.com/wp-content/uploads/Enthought-Python-Pandas-Cheat-Sheets-1-8-v1.0.2.pdf). They use read_table as the main example of an IO method.
And from the analysis in https://github.com/Quansight-Labs/python-api-inspect/ (but don't know which libraries / notebooks they analysed), read_table is after read_csv the most popular read_ method, before things like read_excel and read_hdf.

Anyway, seeing those made me think about this issue. For me it are additional hints that it might be worth actually keeping the read_table name.

@datapythonista
Copy link
Member

Sorry to re-emphasize it here, but I think we'll have an increasing demand to add I/O modules, and I don't think it should be our job to maintain the code to import and export from every possible format out there.

I made my proposal in #26804, which surely is not perfect, but I think we should find a way for third-party developers to implement their own connectors and let them integrate with pandas in the best way (for the developers of the libraries, and for the users, creating consistent APIs for both).

@jorisvandenbossche
Copy link
Member

That's for sure a good discussion to have, but I think not very related to the question: do we keep read_table being available by default in the top-level namespace?

@datapythonista
Copy link
Member

Sorry, you're right, I wouldn't keep read_table.

What I meant with my previous comment is that I'd let users have read_table as a third-party module, and that I'd facilitate that happening (since we'd have the read_table code in pandas, it's not a great example of proper IO module, but more an alias, but still think I think make sense).

@jorisvandenbossche
Copy link
Member

I would say that is an argument to keep it now (for 0.25.0), so that we could afterwards say: "if you want to keep using it, install this package".

@datapythonista
Copy link
Member

I'm ok with that too. But since it's already deprecated, I'd probably keep it deprecated (I don't think the discussion is about removing it, or is it?)

@st-bender
Copy link
Author

Hi,

Sorry, you're right, I wouldn't keep read_table.

Thanks for your opinion, but I think it would be appreciated if you could elaborate on why you think that the arguments in this issue in favour of keeping it are not relevant. In my case I would rather say "I wouldn't keep read_csv()". But this has been already discussed above, and a solution was proposed to keep both people happy.

@datapythonista
Copy link
Member

it would be appreciated if you could elaborate on why you think that the arguments in this issue in favour of keeping it are not relevant

Sorry if it wasn't clear for you. read_table has already been deprecated. As pandas maintainers we'll be happy to not maintain two function that do the same. And users will appreciate not having to learn about two functions that do the same. read_table or read_txt may be more appropriate names, I can agree on that. But there is no way we're deprecating read_csv, is by far the most used function in pandas (based on views of the documentation pages).

The main point of my previous comment, is that we should make it easy for people like you to implement your own I/O connectors, and use them as if they were provided by pandas. So, in this case you'll be able to maintain a third-party package, providing read_table / to_table functionality, keep your code base as it is, by just installing and importing a module. And we won't have maintain code we don't want to maintain. I think that solution the best for everybody, besides the (in my opinion) all the other benefits detailed in #26804.

@st-bender
Copy link
Author

st-bender commented Jun 26, 2019

it would be appreciated if you could elaborate on why you think that the arguments in this issue in favour of keeping it are not relevant

Sorry if it wasn't clear for you. read_table has already been deprecated. As pandas maintainers we'll be happy to not maintain two function that do the same.

I know that, that's why I opened this issue because I think it is a step into the wrong direction. And if you followed the discussion, my opinion may not reflect the majority, but I am not alone either.

And users will appreciate not having to learn about two functions that do the same. read_table or read_txt may be more appropriate names, I can agree on that.

As I user I would appreciate a stable API, please stick to the discussion here, all the points have already been raised.

But there is no way we're deprecating read_csv, is by far the most used function in pandas (based on views of the documentation pages).

See above, please read the discussion in the current issue carefully.

The main point of my previous comment, is that we should make it easy for people like you to implement your own I/O connectors, and use them as if they were provided by pandas.

If I have to do that, then pandas is useless (for me). Its IO facility and quite intuitive API was the main reason for me to use it at all. If you are going to remove functionality from the main API, I believe that will only hurt the people using pandas. And deprecation as I understand it, is a warning that a function will be removed in future versions.

So, in this case you'll be able to maintain a third-party package, providing read_table / to_table functionality, keep your code base as it is, by just installing and importing a module. And we won't have maintain code we don't want to maintain. I think that solution the best for everybody, besides the (in my opinion) all the other benefits detailed in #26804.

Thank you, but I have no interest in that. As I user, I want to use a module, not program around functions that have been removed for no good reason The linked issue is an implementation issue, not API related, and therefore separate from the current issue. However you implement the details is of no concern to the user.

Edit: I realize that the above sounds like a demand without any contribution, please don't take it personally, I appreciate all the work from the maintainers. I just think that keeping an API stable will have more benefits in the long run.

@jreback jreback added the Blocker Blocking issue or pull request for an upcoming release label Jun 28, 2019
@jorisvandenbossche
Copy link
Member

We decided at the sprint here to keep it for now, so to remove the deprecation. PR at #27102

@jondo
Copy link

jondo commented Sep 18, 2019

I just want to note that CSV is also very often used in the more general meaning "character-separated values". Maybe the documentation of read_csvshould switch to that definition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Blocker Blocking issue or pull request for an upcoming release IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.