Adding 'n/a' to list of strings denoting missing values #16079

chrisgorgo · 2017-04-20T23:17:07Z

closes 'n/a' is not among default list of missing value strings #16078
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

codecov · 2017-04-21T00:40:19Z

Codecov Report

Merging #16079 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #16079   +/-   ##
=======================================
  Coverage   90.75%   90.75%           
=======================================
  Files         161      161           
  Lines       51095    51095           
=======================================
  Hits        46370    46370           
  Misses       4725     4725

Flag	Coverage Δ
#multiple	`88.59% <ø> (ø)`	⬆️
#single	`40.16% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/io/common.py	`69.91% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db419bf...eef689a. Read the comment docs.

chrisgorgo · 2017-04-21T00:42:38Z

Wow. That codecov report makes little sense.

jreback · 2017-04-21T10:37:46Z

I suppose this is ok.

@jorisvandenbossche

jreback · 2017-04-21T10:38:04Z

doc/source/whatsnew/v0.20.0.txt

@@ -521,6 +521,7 @@ Other Enhancements
 - The ``usecols`` argument in ``pd.read_csv()`` now accepts a callable function as a value  (:issue:`14154`)
 - The ``skiprows`` argument in ``pd.read_csv()`` now accepts a callable function as a value  (:issue:`10882`)
 - The ``nrows`` and ``chunksize`` arguments in ``pd.read_csv()`` are supported if both are passed (:issue:`6774`, :issue:`15755`)
+- ``pd.read_csv()`` now treats 'n/a' strings as missing values by default (:issue:`16078`)


Thanks for the feedback, but could you clarify what exactly would you like me to change?

Put double backticks around 'n/a' so it's formatted as code in the HTML (but keep the single-quotes too I think)

chrisgorgo · 2017-04-21T16:17:49Z

Done.

jreback · 2017-04-27T21:41:46Z

@jorisvandenbossche @TomAugspurger ?

lgtm.

TomAugspurger · 2017-04-27T21:45:19Z

I think +1... My only hesitation is that we've run into problems before with people un-wantingly getting stuff converted to NaN, but since we already treat N/A as a nan value, I think this is fine. @jorisvandenbossche thoughts?

jorisvandenbossche · 2017-04-27T22:08:59Z

Yeah, if we would start again, certainly +1 on adding it to the default list to parse as missing. But if someone has csv files with actual valid 'n/a' values, we will for sure break his/her code ... Difficult to assess however if that would occur.
I will say, if you are +1, I won't object :-)

jorisvandenbossche · 2017-04-27T22:09:59Z

Small comment on the PR: I would list this in the backwards incompatible API changes section, not enhancements

TomAugspurger · 2017-04-28T11:05:09Z

Since this is technically an API change, I think we should wait till 0.20.1. Sorry for not getting this in before the RC @chrisfilo

jorisvandenbossche · 2017-05-01T21:13:13Z

Since this is technically an API change, I think we should wait till 0.20.1.

Small correction, I think this should be 0.21.0, as 0.20.1 shouldn't contain API changes.

chrisgorgo · 2017-05-02T04:56:36Z

So how should we proceed? Is there a separate branch for 0.21.0 I should send thi PR to?

TomAugspurger · 2017-05-02T11:17:06Z

So how should we proceed? Is there a separate branch for 0.21.0 I should send thi PR to?

It will still be merged into master, just not until after we've finished the 0.20.0 release (this week sometime). Once that's out, I'll add a whatsnew/0.21.0 and ask you to move your release note over there.

gfyoung · 2017-05-03T06:15:01Z

Also, need more tests! Try passing in data containing your new na_value.

jreback · 2017-05-19T01:14:35Z

an you rebase

chrisgorgo · 2017-05-19T01:40:17Z

Done. @gfyoung the test you requested is here: https://github.com/pandas-dev/pandas/pull/16079/files#diff-6e435422c67fa1384140f92110fb69a7R73

gfyoung · 2017-05-19T02:10:16Z

Awesome! @jreback @TomAugspurger might it be prudent to start sharing the na_values list somehow so that we don't have to do maintenance in two different files? Not blocking this PR but jut a thought.

jreback · 2017-05-19T02:13:19Z

Awesome! @jreback @TomAugspurger might it be prudent to start sharing the na_values list somehow so that we don't have to do maintenance in two different files? Not blocking this PR but jut a thought.

yeah that is certainly possible, if you'd like to do an issue/PR

chrisgorgo · 2017-05-19T15:57:56Z

I'm getting the following test error, but I don't see how it could be related to the changes I made. Please advise:

=================================== FAILURES ===================================
_ TestCommonIOCapabilities.test_write_fspath_all[to_hdf-writer_kwargs3-tables] _

self = <pandas.tests.io.test_common.TestCommonIOCapabilities object at 0x7ff8c30fdc88>
writer_name = 'to_hdf', writer_kwargs = {'key': 'bar', 'mode': 'w'}
module = 'tables'

    @pytest.mark.parametrize('writer_name, writer_kwargs, module', [
        ('to_csv', {}, 'os'),
        ('to_excel', {'engine': 'xlwt'}, 'xlwt'),
        ('to_feather', {}, 'feather'),
        ('to_hdf', {'key': 'bar', 'mode': 'w'}, 'tables'),
        ('to_html', {}, 'os'),
        ('to_json', {}, 'os'),
        ('to_latex', {}, 'os'),
        ('to_msgpack', {}, 'os'),
        ('to_pickle', {}, 'os'),
        ('to_stata', {}, 'os'),
    ])
    def test_write_fspath_all(self, writer_name, writer_kwargs, module):
        p1 = tm.ensure_clean('string')
        p2 = tm.ensure_clean('fspath')
        df = pd.DataFrame({"A": [1, 2]})
    
        with p1 as string, p2 as fspath:
            pytest.importorskip(module)
            mypath = CustomFSPath(fspath)
            writer = getattr(df, writer_name)
    
            writer(string, **writer_kwargs)
            with open(string, 'rb') as f:
                expected = f.read()
    
            writer(mypath, **writer_kwargs)
            with open(fspath, 'rb') as f:
                result = f.read()
    
>           assert result == expected
E           AssertionError: assert b'\x89HDF\r\n...0\x00\x00\x00' == b'\x89HDF\r\n\...0\x00\x00\x00'
E             At index 2620 diff: 164 != 163
E             Use -v to get the full diff

gfyoung · 2017-05-20T16:50:38Z

Do you get this failure consistently? And with which parameters is it failing? (I presume to_csv but want to double check)

chrisgorgo · 2017-05-20T17:56:22Z

Actually it's not to_csv, but to_hdf. If you could trigger rebuild on Circle we can check if this is failing consistently (sorry I don't have permissions to do it): https://circleci.com/gh/pandas-dev/pandas/2031#tests/containers/2

gfyoung · 2017-05-20T18:59:44Z

@chrisfilo : Neither do I (@jreback ?) 😢 However, I have been noticing lately that CircleCI has been acting up on us and failing in random places like this. I'm beginning to wonder if CircleCI is useful for our testing purposes compared to Travis and Appveyor.

chrisgorgo · 2017-06-01T17:59:02Z

ping

jreback · 2017-06-01T22:19:34Z

can you rebase this

jreback · 2017-06-01T22:31:32Z

doc/source/whatsnew/v0.21.0.txt

@@ -49,6 +49,7 @@ Backwards incompatible API changes

 - Accessing a non-existent attribute on a closed :class:`HDFStore` will now
  raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
+- ``pd.read_csv()`` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`)


can you move this right below the other issue (where null is added)

I could, but @jorisvandenbossche made exactly the opposite request: #16079 (comment)

ahh I see. then I will move the other one here :>

chrisgorgo · 2017-06-01T22:35:11Z

Here you go.

PS Hint for keeping contributors happy - merge conflicting PRs in chronological order so people will not be punished with resolving conflicts for no apparent reason.

jreback · 2017-06-01T22:38:44Z

PS Hint for keeping contributors happy - merge conflicting PRs in chronological order so people will not be punished with resolving conflicts for no apparent reason.

hah!

yeah we do leave space in the whatsnew so that we won't have conflicts to begin with.

we have lots of PR's. It is an iterative process. I am a bit biased as I look for recently updated ones that are green. Then I (or others) may make comments, causing another round of changes. The more responsive people are, the quicker things get merged. And of course the more they rebase the less conflicts occur.

:.

lgtm. ping on green.

jreback · 2017-06-01T22:40:02Z

PS Hint for keeping contributors happy - merge conflicting PRs in chronological order so people will not be punished with resolving conflicts for no apparent reason.

this actually is not fair. more fair is merging responsive PR's. They are more likely to be ready. We have long running PR's which need rebasing, which pro-active authors do.

Sure when we have a queue and/or are trying to get out a release things get a bit jumbled...

chrisgorgo · 2017-06-01T23:12:47Z

I believe in this case the more considerate thing to do would've been to read the comments, hit the rebuild button on circle, wait for green and merge this first. Recency of the last comment is a poor proxy of responsiveness if the contributor is blocked on faulty CI. This incredibly simple PR was ready to merge 12 days ago.

Ultimately it's is the faulty CI (or over reliance on it) that is causing the frustrations. I understand that there is a shit tonne of PRs in this project (truly remarkable), but I'm trying to give you a unique perspective of a project newcomer and an ability to improve your practices. Overall goal is to make the community better and more efficient.

jreback · 2017-06-01T23:38:14Z

I believe in this case the more considerate thing to do would've been to read the comments, hit the rebuild button on circle, wait for green and merge this first. Recency of the last comment is a poor proxy of responsiveness if the contributor is blocked on faulty CI. This incredibly simple PR was ready to merge 12 days ago.

@chrisfilo well you are assuming we are contantly searching for PR's that are ready to merge :> this is not the case. rather we DO get a ping and then react, or I do scan things on ready to go PR's. I DO prioritize all green PR's as those are most likely to be ready. Sure a CI failure happens, and certainly if pinged we would restart if possible, but if it didn't get restarted then its a red X, and it appears like the PR is non-responsive. Of could the writer can always do a git commit -C HEAD --all --amend and push to force a rebuild :>

jreback · 2017-06-01T23:43:43Z

pandas/tests/io/parser/na_values.py:73:80: E501 line too long (83 > 79 characters)

flake issue

chrisgorgo · 2017-06-01T23:50:43Z

You were pinged.
Twice.

But you are right that I could've just done git commit -C HEAD --all --amend.

jreback · 2017-06-02T00:28:12Z

thanks!

…6079)

jreback reviewed Apr 21, 2017

View reviewed changes

jreback added IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 21, 2017

TomAugspurger added this to the 0.21.0 milestone May 2, 2017

jreback modified the milestone: 0.21.0 May 5, 2017

chrisgorgo added 7 commits May 18, 2017 18:35

Added 'n/a' as a missing value (TDD)

ed05de7

Adding 'n/a'

b82d850

Adding 'n/a'

c6de730

adding 'n/a'

26a5c1f

whats new

54a8068

quotes

f1d4c66

changelog

8486f71

chrisgorgo force-pushed the patch-1 branch from fb77a66 to 8486f71 Compare May 19, 2017 01:39

OlegShteynbuk mentioned this pull request May 30, 2017

make null lowercase a missing value #16534

Merged

Merge branch 'master' into patch-1

23b7df8

jreback reviewed Jun 1, 2017

View reviewed changes

jreback added this to the 0.21.0 milestone Jun 1, 2017

chrisgorgo and others added 2 commits June 1, 2017 17:01

making flake8 happy

091a6de

codify whatsnew for added na values

eef689a

jreback merged commit 5f312da into pandas-dev:master Jun 2, 2017

Kiv pushed a commit to Kiv/pandas that referenced this pull request Jun 11, 2017

Adding 'n/a' to list of strings denoting missing values (pandas-dev#1…

9e71f08

…6079)

stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017

Adding 'n/a' to list of strings denoting missing values (pandas-dev#1…

67f13bf

…6079)

chris-b1 mentioned this pull request Jan 9, 2018

API/ENH: easier configurable NA values for parsing #19156

Closed

Adding 'n/a' to list of strings denoting missing values #16079

Adding 'n/a' to list of strings denoting missing values #16079

Conversation

chrisgorgo commented Apr 20, 2017

codecov bot commented Apr 21, 2017 • edited Loading

Codecov Report

chrisgorgo commented Apr 21, 2017

jreback commented Apr 21, 2017

jreback Apr 21, 2017

Choose a reason for hiding this comment

chrisgorgo Apr 21, 2017

Choose a reason for hiding this comment

TomAugspurger Apr 21, 2017 • edited Loading

Choose a reason for hiding this comment

chrisgorgo commented Apr 21, 2017

jreback commented Apr 27, 2017

TomAugspurger commented Apr 27, 2017

jorisvandenbossche commented Apr 27, 2017

jorisvandenbossche commented Apr 27, 2017

TomAugspurger commented Apr 28, 2017

jorisvandenbossche commented May 1, 2017

chrisgorgo commented May 2, 2017

TomAugspurger commented May 2, 2017

gfyoung commented May 3, 2017

jreback commented May 19, 2017

chrisgorgo commented May 19, 2017

gfyoung commented May 19, 2017

jreback commented May 19, 2017

chrisgorgo commented May 19, 2017

gfyoung commented May 20, 2017

chrisgorgo commented May 20, 2017

gfyoung commented May 20, 2017 • edited Loading

chrisgorgo commented Jun 1, 2017

jreback commented Jun 1, 2017

jreback Jun 1, 2017

Choose a reason for hiding this comment

chrisgorgo Jun 1, 2017

Choose a reason for hiding this comment

jreback Jun 2, 2017

Choose a reason for hiding this comment

chrisgorgo commented Jun 1, 2017

jreback commented Jun 1, 2017

jreback commented Jun 1, 2017

chrisgorgo commented Jun 1, 2017

jreback commented Jun 1, 2017

jreback commented Jun 1, 2017

chrisgorgo commented Jun 1, 2017

jreback commented Jun 2, 2017

codecov bot commented Apr 21, 2017 •

edited

Loading

TomAugspurger Apr 21, 2017 •

edited

Loading

gfyoung commented May 20, 2017 •

edited

Loading