Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow usecols to accept callable (GH14154) #14234

Merged
merged 1 commit into from
Dec 6, 2016

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Sep 16, 2016

@codecov-io
Copy link

codecov-io commented Sep 16, 2016

Current coverage is 85.28% (diff: 100%)

Merging #14234 into master will increase coverage by <.01%

@@             master     #14234   diff @@
==========================================
  Files           144        144          
  Lines         50953      50963    +10   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43452      43462    +10   
  Misses         7501       7501          
  Partials          0          0          

Powered by Codecov. Last update 4378f82...9b0d112


if usecols is not None:
if callable(usecols):
return usecols
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just in-line _evaluate_usecols here, then change _vaidate_usecols to return the usecols and set it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look but don't think that would work. The problem seems to be the names argument that's a requirement for _evaluate_usecols.

Unless I'm missing something I think I'd have to modify _validate_usecols to accept names as an argument, but it's currently being called before the names are even set (unless explicitly passed as an argument to the read function)

@jreback jreback added Enhancement IO CSV read_csv, to_csv labels Sep 16, 2016
usecols : array-like, default None
Return a subset of the columns. All elements in this array must either
usecols : array-like or callable, default None
Return a subset of the columns. If array-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid `usecols`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a mini-example of a callable

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

can you rebase, move whatsnew notes to 0.20.0, and add an example as indicated

@WillAyd
Copy link
Member Author

WillAyd commented Nov 25, 2016

Requested changes should have been made - let me know if there's anything else you would like to tweak for this enhancement

@@ -31,6 +31,7 @@ Other enhancements
^^^^^^^^^^^^^^^^^^

- ``pd.read_excel`` now preserves sheet order when using ``sheetname=None`` (:issue:`9930`)
- The ``usecols`` argument now accepts a callable function as a value (:issue:`14154`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usecols argument in pd.read_csv

if callable(usecols):
return set([i for i, name in enumerate(names)
if com._apply_if_callable(usecols, name)])
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just return is fine here (no else)

msg = ("The elements of 'usecols' must "
"either be all strings, all unicode, or all integers")
msg = ("'usecols' must either be all strings, all unicode, "
"all integers or callable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a callable

"either be all strings, all unicode, or all integers")

msg = ("'usecols' must either be all strings, all unicode, "
"all integers or callable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a callable

msg = ("The elements of 'usecols' must "
"either be all strings, all unicode, or all integers")
msg = ("'usecols' must either be all strings, all unicode, "
"all integers or callable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -366,3 +367,25 @@ def test_np_array_usecols(self):
expected = DataFrame([[1, 2]], columns=usecols)
result = self.read_csv(StringIO(data), usecols=usecols)
tm.assert_frame_equal(result, expected)

def test_callable_usecols(self):
s = '''AaA,bBb,CCC,ddd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment

@@ -437,7 +440,11 @@ cdef class TextReader:
# suboptimal
if usecols is not None:
self.has_usecols = 1
self.usecols = set(usecols)
if callable(usecols):
self.callable_usecols = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not use this additional variable and instead just do a callable(self.usecalls) I think its more obvious

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

@WillAyd thanks. couple more pretty minor comments. otherwise lgtm.

@WillAyd WillAyd force-pushed the callable-usecols branch 2 times, most recently from 1986920 to 5c933db Compare November 26, 2016 23:29
@WillAyd
Copy link
Member Author

WillAyd commented Nov 27, 2016

All of your requested changes should be accounted for. Let me know if anything else comes to mind.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the docstring and whatsnew, can you also update the IO docs in two places:

results in much faster parsing time and lower memory usage.
parameter would be [0, 1, 2], ['foo', 'bar', 'baz'] or lambda x: x.upper()
in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster
parsing time and lower memory usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add before the example some explanation how the callable is used? (I suppose it is called on each individual column name and should return True or False?)

And maybe also use `` around the code example to better indicate that the "in ['AAA', 'BBB', 'DDD']" still belongs with the "lambda x: x.upper()"

def _evaluate_usecols(usecols, names):
if callable(usecols):
return set([i for i, name in enumerate(names)
if com._apply_if_callable(usecols, name)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the com._apply_if_callable is not needed anymore as you already did if callable(usecols)? (apply_if_callable does exactly the same under the hood)


df = self.read_csv(StringIO(s), usecols=lambda x:
x.upper() in ['AAA', 'BBB', 'DDD'])
tm.assert_frame_equal(df, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that lambda x: False return an empty frame? (similar to the test_empty_usecols above)

name in self.usecols):
usecols = set()
if callable(self.usecols):
if com._apply_if_callable(self.usecols, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, com._apply_if_callable is not needed I think

@WillAyd
Copy link
Member Author

WillAyd commented Nov 30, 2016

I noticed that the Travis-CI build failed for OS X. I didn't see anything useful in the log and checking their website it appears that the failure may have stemmed from issues on their end and not necessarily from the code. Do you have a preferred method to restart that build? Should I close / re-open the PR or do you have another way you'd like to restart that test?

screen shot 2016-11-29 at 7 01 45 pm

@jreback
Copy link
Contributor

jreback commented Nov 30, 2016

I restarted the job.

parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
results in much faster parsing time and lower memory usage.
inferred from the document header row(s). For example, a valid array-like
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. If callable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is docs, you can format this to read easier. (use an ipython block), blank line

inferred from the document header row(s). For example, a valid array-like
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. If
callable, the callable function will be evaluated against the column names,
returning names where the callable function evaluates to True. An example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't format this (like you can an .rst), but maybe a blank line or some bullet points?

@@ -976,17 +980,26 @@ def _is_index_col(col):
return col is not None and col is not False


def _evaluate_usecols(usecols, names):
if callable(usecols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give a mini-doc string here

x.upper() in ['AAA', 'BBB', 'DDD'])
tm.assert_frame_equal(df, expected)

def test_callable_usecols_with_false_callable(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put this test with the above one, just put a 1-line comment to separate them (I think they use the same input as well), to avoid duplication

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

@WillAyd just a couple of minor comments. ping when green.

@WillAyd
Copy link
Member Author

WillAyd commented Dec 6, 2016

@jreback pushed some edits to cover your latest comments. Let me know if there's anything else you'd like on this

@jreback
Copy link
Contributor

jreback commented Dec 6, 2016

lgtm. ping on green.

@jorisvandenbossche jorisvandenbossche merged commit 8cdc09e into pandas-dev:master Dec 6, 2016
@jorisvandenbossche
Copy link
Member

@WillAyd Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: allow usecols to be case insensitive
4 participants