-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep (un-deprecate) read_table()
for API stability
#25220
Comments
pls reference the issue where this was deprecated also you might want to highlite and rebut some arguments there |
my point is to summarize the arguments in this issue (for and against) |
Hi, Note that I am not really opposed to keeping both functions, so the title may be a little provocative and could be adjusted. My points pro keeping
Contra (as far as I can see):
|
Using |
Keep in mind that the stdlib provides this under the
All else equal, we prefer not to keep around aliases. They tend to make the library harder to learn and use. |
Good point, I was actually used to numpy's
Fair enough, and I would agree as a developer. However, you could still refer one to the other and state that they are kept for backwards compatibility. Keep in mind that any part of a stable API that is removed will break code. Not everybody may keep track of changes in the underlying libraries and modules. Imaging someone updates his or her setup at some point in the future for bugfixes and new or improved features. Then he or she notices that one or more of the packages in the setup uses that function (or any other that you are going to remove) which then breaks that package and in turn his own code. Those bugs are nasty to track down, and may lead to people not updating packages or rolling back updates. |
This is just deprecated right now, and will be for some time, to give the
community time to catch up.
Pandas is still evolving rapidly enough that I'd be pleasantly surprised if
`read_table` being dropped was
the hardest part of upgrading between two distant versions.
…On Wed, Mar 6, 2019 at 3:12 PM Stefan Bender ***@***.***> wrote:
Keep the function with the more general name (IMHO) over the special case
Keep in mind that the stdlib provides this under the csv module.
Shouldn't we be consistent with the stdlib? If you're coming from a Python
backgound, wouldn't you look there first?
Good point, I was actually used to numpy's genfromtxt(), so I was looking
for something like that. I am not used to csv (files and module) that
much. So in my case that's probably why I missed that connection.
Two lines of extra code.
All else equal, we prefer not to keep around aliases. They tend to make
the library harder to learn and use.
Fair enough, and I would agree as a developer. However, you could still
refer one to the other and state that they are kept for backwards
compatibility. Keep in mind that any part of a *stable* API that is
removed will break code. Not everybody may keep track of changes in the
underlying libraries and modules. Imaging someone updates his or her setup
at some point in the future for bugfixes and new or improved features. Then
he or she notices that one or more of the packages in the setup uses that
function (or any other that you are going to remove) which then breaks that
package and in turn his own code. Those bugs are nasty to track down, and
may lead to people not updating packages or rolling back updates.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#25220 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInsAqlRmYxqEv8mLZ9jr11yxUBn-ks5vUC8_gaJpZM4aqRXG>
.
|
I know we've had some discussions around read_table which is fine to keep ongoing, but I would be a HUGE -1 on this. I would be inclined to believe that the VAST majority of users use Here's the top 5 search results for me for "pandas tutorial" excluding pandas own documentation (this is by no means an endorsement of any of these just sharing as reference):
All use read_csv, none use read_table. Again I understand the merits of cleaning up semantics here, but from a pure usage analysis and effort trade-off I would be very against doing this |
Isn't the whole point to remove it in the future? Unless I misunderstood the point of deprecating something. So anybody who missed the deprecation warning for whatever reason will be surprised.
I understand that, but I am tempted to take that bet. I was actually not talking only about |
They use But I realize that we are fighting over two lines of code instead of fixing the almost 3000 real bugs. And I can't help you with that unless you really keep your API stable and backwards compatible. |
Thanks for the feedback. I think we have different opinions on this, which
is fine.
…On Thu, Mar 7, 2019 at 3:41 AM Stefan Bender ***@***.***> wrote:
I know we've had some discussions around read_table which is fine to keep
ongoing, but I would be a HUGE -1 on this. I would be inclined to believe
that the VAST majority of users use read_csv instead of read_table so
doing this would be a nightmare for the sake of semantics.
Here's the top 5 search results for me for "pandas tutorial" excluding
pandas own documentation (this is by no means an endorsement of any of
these just sharing as reference):
1. https://www.dataquest.io/blog/pandas-python-tutorial/
2.
https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
3.
https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/
4. https://data-flair.training/blogs/pandas-tutorial/
5. https://www.learnpython.org/en/Pandas_Basics
All use read_csv, none use read_table. Again I understand the merits of
cleaning up semantics here, but from a pure usage analysis and effort
trade-off I would be very against doing this
They use read_csv for reading .csv which is what it is designed for and
named after. I am talking about reading space/tab delimited files, for
which read_csv() is misnamed and not the first place I would look for
(and didn't look). The title was meant to provoke discussion, otherwise you
may not have stepped in. ;)
The only issue I actually have is changing the API in a
backwards-incompatible way, i.e. removing an actually used function from
it, which will break other people's code. Granted it may not be the vast
majority but it's not negligible either. Yet on the other hand you
introduce new interfaces like read_fmf() which does essentially the same,
all three cases are handled by numpy's genfromtxt().
But I realize that we are fighting over two lines of code instead of
fixing the almost 3000 real bugs. And I can't help you with that unless you
really keep your API stable and backwards compatible.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#25220 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHItOvF37K6ZXbwSjbKjUTsOBpoPQ2ks5vUN7egaJpZM4aqRXG>
.
|
This makes sense if the aliases are all superfluous and the base function has a reasonable name. You can't reasonably make that argument here.
So deprecate
Arguments like those are akin to saying progress should never be made for the sake of stability. At the very least |
You can't reasonably make that argument here.
Why do you say that? The stdlib uses the "csv" module for reading delimited
files.
…On Thu, Mar 7, 2019 at 3:17 PM Jimmy Wan ***@***.***> wrote:
All else equal, we prefer not to keep around aliases. They tend to make
the library harder to learn and use.
This makes sense if the aliases are all superfluous and the base function
has a reasonable name. You can't reasonably make that argument here.
This is just deprecated right now, and will be for some time, to give the
community time to catch up.
So deprecate read_csv and give them a healthy window to catch up. Giving
general functions specific names is bad design. If you care at all about
having a clean API, the decision to choose read_csv over read_table was
absolutely backwards.
I would be inclined to believe that the VAST majority of users use
read_csv instead of read_table so doing this would be a nightmare for the
sake of semantics.
Arguments like those are akin to saying progress should never be made for
the sake of stability. At the very least read_table should be
un-deprecated.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#25220 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIgLoOEgvvGDo8_M_Xkxk_WviesImks5vUYIAgaJpZM4aqRXG>
.
|
I understand there are some frustrations with the deprecation of
Pandas is an open source tool built off of volunteer hours. We would gladly welcome contributions from you to improve the tool. Otherwise I'd just like to call out that incendiary and insulting comments don't contribute anything to the tool or the community surrounding it. |
I'm not the OP, but I think some of us are trying to say that deprecating
That's more of a historical artifact and/or implementation detail. It's highly unlikely that a pandas end user knows and/or cares about it. No reason to propagate that decision any further. Sidenote: The API choices by readr from the tidyverse to make |
On Mar 7, 2019, at 16:54, Jimmy Wan ***@***.***> wrote:
That's more of a historical artifact and/or implementation detail. It's highly unlikely that a pandas end user knows and/or cares about it.
Do you have a source for that?
|
I'm not sure that I understand your question or the motivation behind it, but... The original PEP was written > 15 years ago and it's fairly clear that the specific problem they were trying to solve is handling all of the annoying corner cases involved with a poorly defined file format that is ill conceived. Using a delimiter that shows up all over the place in every day life and real world content is just begging to generate corner cases, and that it has, with all of the crazy rules you have to put in place to deal with commas or single/double quotes. There is a passing reference to trying to make it be something slightly more general in support of Excel and excel-generated content, because that's been a tool of choice of real world users for decades. There's really no mention of trying to specifically "name" it to reflect being a general purpose parser building block or anything of that sort. As mentioned in the PEP, it was largely based upon a library named Maybe we just agree to disagree ¯\(ツ)/¯, but none of that really has anything to do with whether we, the "royal" we ;), in 2019, should be following a naming convention created in 2003 for an entirely different problem. P.S. huge fan of your work with dask/dask-ml! |
Thanks for linking to that PEP. I hadn’t seen it before.
… On Mar 7, 2019, at 18:44, Jimmy Wan ***@***.***> wrote:
That's more of a historical artifact and/or implementation detail. It's highly unlikely that a pandas end user knows and/or cares about it.
Do you have a source for that?
I'm not sure that I understand your question or the motivation behind it, but...
The original PEP was written > 15 years ago and it's fairly clear that the specific problem they were trying to solve is handling all of the annoying corner cases involved with a poorly defined file format that is ill conceived. Using a delimiter that shows up all over the place in every day life and real world content is just begging to generate corner cases, and that it has, with all of the crazy rules you have to put in place to deal with commas or single/double quotes. There is a passing reference to trying to make it be something slightly more general in support of Excel and excel-generated content, because that's been a tool of choice of real world users for decades. There's really no mention of trying to specifically "name" it to reflect being a general purpose parser building block or anything of that sort.
https://www.python.org/dev/peps/pep-0305/#id13
As mentioned in the PEP, it was largely based upon a library named csv specifically built to parse CSV files:
http://www.object-craft.com.au/projects/csv/
Maybe we just agree to disagree ¯\(ツ)/¯, but none of that really has anything to do with whether we, the "royal" we ;), in 2019, should be following a naming convention created in 2003 for an entirely different problem.
P.S. huge fan of your work with dask/dask-ml!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
read_csv()
instead of read_table()
read_table()
for API stability
Did that, is it better now?
Sorry, I didn't mean to insult anybody. My problem is that I write code to do my my work and want to make sure that it still works in a few years from now when maybe someone else takes it over, or wants to reproduce my results. I can't do that if the API is constantly changing, or I would have to cap the max versions on the dependencies. So I am looking for solutions that are --- hopefully --- future proof. |
Add my vote to keeping I've been a pandas user for 7+ years and have used I will note that the documentation supports this interpretation: I have to say in my corner of datasciences, no one has touched the standard csv module, while everyone uses pandas, so I think it should decide for itself what's right. (A better point of comparison is probably R, which has both a I generally agree with minimizing aliases, but this is at the core of what pandas does and is a very widespread alias. While "pd.read_csv" is clearly more commonly used -- I wouldn't suggest getting rid of it! -- a quick unscientific github search yields 50k+ results for "pd.read_table". |
In bioinformatics I have to deal with a LOT of tab-delimited data formats and almost never encounter comma-delimited data so this change has made a very prominent impact in my daily use. I too would like to see In my view implementing a general case |
@pandas-dev/pandas-core I know this was talked about during the call yesterday. FWIW I actually like the |
I just saw this in the enthought cheatsheet passing on Twitter (https://www.enthought.com/wp-content/uploads/Enthought-Python-Pandas-Cheat-Sheets-1-8-v1.0.2.pdf). They use Anyway, seeing those made me think about this issue. For me it are additional hints that it might be worth actually keeping the |
Sorry to re-emphasize it here, but I think we'll have an increasing demand to add I/O modules, and I don't think it should be our job to maintain the code to import and export from every possible format out there. I made my proposal in #26804, which surely is not perfect, but I think we should find a way for third-party developers to implement their own connectors and let them integrate with pandas in the best way (for the developers of the libraries, and for the users, creating consistent APIs for both). |
That's for sure a good discussion to have, but I think not very related to the question: do we keep |
Sorry, you're right, I wouldn't keep What I meant with my previous comment is that I'd let users have |
I would say that is an argument to keep it now (for 0.25.0), so that we could afterwards say: "if you want to keep using it, install this package". |
I'm ok with that too. But since it's already deprecated, I'd probably keep it deprecated (I don't think the discussion is about removing it, or is it?) |
Hi,
Thanks for your opinion, but I think it would be appreciated if you could elaborate on why you think that the arguments in this issue in favour of keeping it are not relevant. In my case I would rather say "I wouldn't keep |
Sorry if it wasn't clear for you. The main point of my previous comment, is that we should make it easy for people like you to implement your own I/O connectors, and use them as if they were provided by pandas. So, in this case you'll be able to maintain a third-party package, providing |
I know that, that's why I opened this issue because I think it is a step into the wrong direction. And if you followed the discussion, my opinion may not reflect the majority, but I am not alone either.
As I user I would appreciate a stable API, please stick to the discussion here, all the points have already been raised.
See above, please read the discussion in the current issue carefully.
If I have to do that, then pandas is useless (for me). Its IO facility and quite intuitive API was the main reason for me to use it at all. If you are going to remove functionality from the main API, I believe that will only hurt the people using pandas. And deprecation as I understand it, is a warning that a function will be removed in future versions.
Thank you, but I have no interest in that. As I user, I want to use a module, not program around functions that have been removed for no good reason The linked issue is an implementation issue, not API related, and therefore separate from the current issue. However you implement the details is of no concern to the user. Edit: I realize that the above sounds like a demand without any contribution, please don't take it personally, I appreciate all the work from the maintainers. I just think that keeping an API stable will have more benefits in the long run. |
We decided at the sprint here to keep it for now, so to remove the deprecation. PR at #27102 |
I just want to note that CSV is also very often used in the more general meaning "character-separated values". Maybe the documentation of |
Problem description
read_table()
got deprecated in favour of usingread_csv()
.Using
read_csv()
to read tab/space delimited files is counter-intuitive. According to the docs and the related issues, both share the same code and it is not clear why the one function should be preferred over the other, and that change may even break existing code.In my point of view
read_table()
is the more general function andread_csv()
is a special case. Why would you deprecate (and then remove) the more useful function? It is already annoying to useto_csv()
to write space/tab delimited files. And as I can see it, it comes down to two lines of code.Proposed solution
Keep both functions as they are (un-deprecate
read_table()
) or rename the function to have a more general name likeread_txt()
(as innumpy.genfromtxt()
) or similar.The text was updated successfully, but these errors were encountered: