-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding 'n/a' to list of strings denoting missing values #16079
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16079 +/- ##
=======================================
Coverage 90.75% 90.75%
=======================================
Files 161 161
Lines 51095 51095
=======================================
Hits 46370 46370
Misses 4725 4725
Continue to review full report at Codecov.
|
Wow. That codecov report makes little sense. |
I suppose this is ok. |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -521,6 +521,7 @@ Other Enhancements | |||
- The ``usecols`` argument in ``pd.read_csv()`` now accepts a callable function as a value (:issue:`14154`) | |||
- The ``skiprows`` argument in ``pd.read_csv()`` now accepts a callable function as a value (:issue:`10882`) | |||
- The ``nrows`` and ``chunksize`` arguments in ``pd.read_csv()`` are supported if both are passed (:issue:`6774`, :issue:`15755`) | |||
- ``pd.read_csv()`` now treats 'n/a' strings as missing values by default (:issue:`16078`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use n/a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, but could you clarify what exactly would you like me to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put double backticks around 'n/a' so it's formatted as code in the HTML (but keep the single-quotes too I think)
Done. |
@jorisvandenbossche @TomAugspurger ? lgtm. |
I think +1... My only hesitation is that we've run into problems before with people un-wantingly getting stuff converted to NaN, but since we already treat N/A as a nan value, I think this is fine. @jorisvandenbossche thoughts? |
Yeah, if we would start again, certainly +1 on adding it to the default list to parse as missing. But if someone has csv files with actual valid 'n/a' values, we will for sure break his/her code ... Difficult to assess however if that would occur. |
Small comment on the PR: I would list this in the backwards incompatible API changes section, not enhancements |
Since this is technically an API change, I think we should wait till 0.20.1. Sorry for not getting this in before the RC @chrisfilo |
Small correction, I think this should be 0.21.0, as 0.20.1 shouldn't contain API changes. |
So how should we proceed? Is there a separate branch for 0.21.0 I should send thi PR to? |
It will still be merged into master, just not until after we've finished the 0.20.0 release (this week sometime). Once that's out, I'll add a |
Also, need more tests! Try passing in data containing your new |
an you rebase |
Done. @gfyoung the test you requested is here: https://github.com/pandas-dev/pandas/pull/16079/files#diff-6e435422c67fa1384140f92110fb69a7R73 |
Awesome! @jreback @TomAugspurger might it be prudent to start sharing the na_values list somehow so that we don't have to do maintenance in two different files? Not blocking this PR but jut a thought. |
yeah that is certainly possible, if you'd like to do an issue/PR |
I'm getting the following test error, but I don't see how it could be related to the changes I made. Please advise:
|
Do you get this failure consistently? And with which parameters is it failing? (I presume |
Actually it's not |
@chrisfilo : Neither do I (@jreback ?) 😢 However, I have been noticing lately that |
ping |
can you rebase this |
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -49,6 +49,7 @@ Backwards incompatible API changes | |||
|
|||
- Accessing a non-existent attribute on a closed :class:`HDFStore` will now | |||
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`) | |||
- ``pd.read_csv()`` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this right below the other issue (where null is added)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but @jorisvandenbossche made exactly the opposite request: #16079 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh I see. then I will move the other one here :>
Here you go. PS Hint for keeping contributors happy - merge conflicting PRs in chronological order so people will not be punished with resolving conflicts for no apparent reason. |
hah! yeah we do leave space in the whatsnew so that we won't have conflicts to begin with. we have lots of PR's. It is an iterative process. I am a bit biased as I look for recently updated ones that are green. Then I (or others) may make comments, causing another round of changes. The more responsive people are, the quicker things get merged. And of course the more they rebase the less conflicts occur. :. lgtm. ping on green. |
this actually is not fair. more fair is merging responsive PR's. They are more likely to be ready. We have long running PR's which need rebasing, which pro-active authors do. Sure when we have a queue and/or are trying to get out a release things get a bit jumbled... |
I believe in this case the more considerate thing to do would've been to read the comments, hit the rebuild button on circle, wait for green and merge this first. Recency of the last comment is a poor proxy of responsiveness if the contributor is blocked on faulty CI. This incredibly simple PR was ready to merge 12 days ago. Ultimately it's is the faulty CI (or over reliance on it) that is causing the frustrations. I understand that there is a shit tonne of PRs in this project (truly remarkable), but I'm trying to give you a unique perspective of a project newcomer and an ability to improve your practices. Overall goal is to make the community better and more efficient. |
@chrisfilo well you are assuming we are contantly searching for PR's that are ready to merge :> this is not the case. rather we DO get a ping and then react, or I do scan things on ready to go PR's. I DO prioritize all green PR's as those are most likely to be ready. Sure a CI failure happens, and certainly if pinged we would restart if possible, but if it didn't get restarted then its a red X, and it appears like the PR is non-responsive. Of could the writer can always do a |
flake issue |
thanks! |
git diff upstream/master --name-only -- '*.py' | flake8 --diff