Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding 'n/a' to list of strings denoting missing values #16079

Merged
merged 10 commits into from
Jun 2, 2017

Conversation

chrisgorgo
Copy link
Contributor

@codecov
Copy link

codecov bot commented Apr 21, 2017

Codecov Report

Merging #16079 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #16079   +/-   ##
=======================================
  Coverage   90.75%   90.75%           
=======================================
  Files         161      161           
  Lines       51095    51095           
=======================================
  Hits        46370    46370           
  Misses       4725     4725
Flag Coverage Δ
#multiple 88.59% <ø> (ø) ⬆️
#single 40.16% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/common.py 69.91% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db419bf...eef689a. Read the comment docs.

@chrisgorgo
Copy link
Contributor Author

Wow. That codecov report makes little sense.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2017

I suppose this is ok.

@jorisvandenbossche

@@ -521,6 +521,7 @@ Other Enhancements
- The ``usecols`` argument in ``pd.read_csv()`` now accepts a callable function as a value (:issue:`14154`)
- The ``skiprows`` argument in ``pd.read_csv()`` now accepts a callable function as a value (:issue:`10882`)
- The ``nrows`` and ``chunksize`` arguments in ``pd.read_csv()`` are supported if both are passed (:issue:`6774`, :issue:`15755`)
- ``pd.read_csv()`` now treats 'n/a' strings as missing values by default (:issue:`16078`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use n/a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, but could you clarify what exactly would you like me to change?

Copy link
Contributor

@TomAugspurger TomAugspurger Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put double backticks around 'n/a' so it's formatted as code in the HTML (but keep the single-quotes too I think)

@jreback jreback added IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 21, 2017
@chrisgorgo
Copy link
Contributor Author

Done.

@jreback
Copy link
Contributor

jreback commented Apr 27, 2017

@jorisvandenbossche @TomAugspurger ?

lgtm.

@TomAugspurger
Copy link
Contributor

I think +1... My only hesitation is that we've run into problems before with people un-wantingly getting stuff converted to NaN, but since we already treat N/A as a nan value, I think this is fine. @jorisvandenbossche thoughts?

@jorisvandenbossche
Copy link
Member

Yeah, if we would start again, certainly +1 on adding it to the default list to parse as missing. But if someone has csv files with actual valid 'n/a' values, we will for sure break his/her code ... Difficult to assess however if that would occur.
I will say, if you are +1, I won't object :-)

@jorisvandenbossche
Copy link
Member

Small comment on the PR: I would list this in the backwards incompatible API changes section, not enhancements

@TomAugspurger
Copy link
Contributor

Since this is technically an API change, I think we should wait till 0.20.1. Sorry for not getting this in before the RC @chrisfilo

@jorisvandenbossche
Copy link
Member

Since this is technically an API change, I think we should wait till 0.20.1.

Small correction, I think this should be 0.21.0, as 0.20.1 shouldn't contain API changes.

@chrisgorgo
Copy link
Contributor Author

So how should we proceed? Is there a separate branch for 0.21.0 I should send thi PR to?

@TomAugspurger
Copy link
Contributor

So how should we proceed? Is there a separate branch for 0.21.0 I should send thi PR to?

It will still be merged into master, just not until after we've finished the 0.20.0 release (this week sometime). Once that's out, I'll add a whatsnew/0.21.0 and ask you to move your release note over there.

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone May 2, 2017
@gfyoung
Copy link
Member

gfyoung commented May 3, 2017

Also, need more tests! Try passing in data containing your new na_value.

@jreback jreback modified the milestone: 0.21.0 May 5, 2017
@jreback
Copy link
Contributor

jreback commented May 19, 2017

an you rebase

@chrisgorgo
Copy link
Contributor Author

@gfyoung
Copy link
Member

gfyoung commented May 19, 2017

Awesome! @jreback @TomAugspurger might it be prudent to start sharing the na_values list somehow so that we don't have to do maintenance in two different files? Not blocking this PR but jut a thought.

@jreback
Copy link
Contributor

jreback commented May 19, 2017

Awesome! @jreback @TomAugspurger might it be prudent to start sharing the na_values list somehow so that we don't have to do maintenance in two different files? Not blocking this PR but jut a thought.

yeah that is certainly possible, if you'd like to do an issue/PR

@chrisgorgo
Copy link
Contributor Author

I'm getting the following test error, but I don't see how it could be related to the changes I made. Please advise:

=================================== FAILURES ===================================
_ TestCommonIOCapabilities.test_write_fspath_all[to_hdf-writer_kwargs3-tables] _

self = <pandas.tests.io.test_common.TestCommonIOCapabilities object at 0x7ff8c30fdc88>
writer_name = 'to_hdf', writer_kwargs = {'key': 'bar', 'mode': 'w'}
module = 'tables'

    @pytest.mark.parametrize('writer_name, writer_kwargs, module', [
        ('to_csv', {}, 'os'),
        ('to_excel', {'engine': 'xlwt'}, 'xlwt'),
        ('to_feather', {}, 'feather'),
        ('to_hdf', {'key': 'bar', 'mode': 'w'}, 'tables'),
        ('to_html', {}, 'os'),
        ('to_json', {}, 'os'),
        ('to_latex', {}, 'os'),
        ('to_msgpack', {}, 'os'),
        ('to_pickle', {}, 'os'),
        ('to_stata', {}, 'os'),
    ])
    def test_write_fspath_all(self, writer_name, writer_kwargs, module):
        p1 = tm.ensure_clean('string')
        p2 = tm.ensure_clean('fspath')
        df = pd.DataFrame({"A": [1, 2]})
    
        with p1 as string, p2 as fspath:
            pytest.importorskip(module)
            mypath = CustomFSPath(fspath)
            writer = getattr(df, writer_name)
    
            writer(string, **writer_kwargs)
            with open(string, 'rb') as f:
                expected = f.read()
    
            writer(mypath, **writer_kwargs)
            with open(fspath, 'rb') as f:
                result = f.read()
    
>           assert result == expected
E           AssertionError: assert b'\x89HDF\r\n...0\x00\x00\x00' == b'\x89HDF\r\n\...0\x00\x00\x00'
E             At index 2620 diff: 164 != 163
E             Use -v to get the full diff

@gfyoung
Copy link
Member

gfyoung commented May 20, 2017

Do you get this failure consistently? And with which parameters is it failing? (I presume to_csv but want to double check)

@chrisgorgo
Copy link
Contributor Author

Actually it's not to_csv, but to_hdf. If you could trigger rebuild on Circle we can check if this is failing consistently (sorry I don't have permissions to do it): https://circleci.com/gh/pandas-dev/pandas/2031#tests/containers/2

@gfyoung
Copy link
Member

gfyoung commented May 20, 2017

@chrisfilo : Neither do I (@jreback ?) 😢 However, I have been noticing lately that CircleCI has been acting up on us and failing in random places like this. I'm beginning to wonder if CircleCI is useful for our testing purposes compared to Travis and Appveyor.

@chrisgorgo
Copy link
Contributor Author

ping

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

can you rebase this

@@ -49,6 +49,7 @@ Backwards incompatible API changes

- Accessing a non-existent attribute on a closed :class:`HDFStore` will now
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
- ``pd.read_csv()`` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this right below the other issue (where null is added)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but @jorisvandenbossche made exactly the opposite request: #16079 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see. then I will move the other one here :>

@jreback jreback added this to the 0.21.0 milestone Jun 1, 2017
@chrisgorgo
Copy link
Contributor Author

Here you go.

PS Hint for keeping contributors happy - merge conflicting PRs in chronological order so people will not be punished with resolving conflicts for no apparent reason.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

PS Hint for keeping contributors happy - merge conflicting PRs in chronological order so people will not be punished with resolving conflicts for no apparent reason.

hah!

yeah we do leave space in the whatsnew so that we won't have conflicts to begin with.

we have lots of PR's. It is an iterative process. I am a bit biased as I look for recently updated ones that are green. Then I (or others) may make comments, causing another round of changes. The more responsive people are, the quicker things get merged. And of course the more they rebase the less conflicts occur.

:.

lgtm. ping on green.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

PS Hint for keeping contributors happy - merge conflicting PRs in chronological order so people will not be punished with resolving conflicts for no apparent reason.

this actually is not fair. more fair is merging responsive PR's. They are more likely to be ready. We have long running PR's which need rebasing, which pro-active authors do.

Sure when we have a queue and/or are trying to get out a release things get a bit jumbled...

@chrisgorgo
Copy link
Contributor Author

I believe in this case the more considerate thing to do would've been to read the comments, hit the rebuild button on circle, wait for green and merge this first. Recency of the last comment is a poor proxy of responsiveness if the contributor is blocked on faulty CI. This incredibly simple PR was ready to merge 12 days ago.

Ultimately it's is the faulty CI (or over reliance on it) that is causing the frustrations. I understand that there is a shit tonne of PRs in this project (truly remarkable), but I'm trying to give you a unique perspective of a project newcomer and an ability to improve your practices. Overall goal is to make the community better and more efficient.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

I believe in this case the more considerate thing to do would've been to read the comments, hit the rebuild button on circle, wait for green and merge this first. Recency of the last comment is a poor proxy of responsiveness if the contributor is blocked on faulty CI. This incredibly simple PR was ready to merge 12 days ago.

@chrisfilo well you are assuming we are contantly searching for PR's that are ready to merge :> this is not the case. rather we DO get a ping and then react, or I do scan things on ready to go PR's. I DO prioritize all green PR's as those are most likely to be ready. Sure a CI failure happens, and certainly if pinged we would restart if possible, but if it didn't get restarted then its a red X, and it appears like the PR is non-responsive. Of could the writer can always do a git commit -C HEAD --all --amend and push to force a rebuild :>

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

pandas/tests/io/parser/na_values.py:73:80: E501 line too long (83 > 79 characters)

flake issue

@chrisgorgo
Copy link
Contributor Author

You were pinged.
Twice.

But you are right that I could've just done git commit -C HEAD --all --amend.

@jreback jreback merged commit 5f312da into pandas-dev:master Jun 2, 2017
@jreback
Copy link
Contributor

jreback commented Jun 2, 2017

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'n/a' is not among default list of missing value strings
5 participants