-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow usecols to accept callable (GH14154) #14234
Conversation
Current coverage is 85.28% (diff: 100%)@@ master #14234 diff @@
==========================================
Files 144 144
Lines 50953 50963 +10
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43452 43462 +10
Misses 7501 7501
Partials 0 0
|
|
||
if usecols is not None: | ||
if callable(usecols): | ||
return usecols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just in-line _evaluate_usecols
here, then change _vaidate_usecols
to return the usecols and set it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look but don't think that would work. The problem seems to be the names argument that's a requirement for _evaluate_usecols
.
Unless I'm missing something I think I'd have to modify _validate_usecols
to accept names as an argument, but it's currently being called before the names are even set (unless explicitly passed as an argument to the read
function)
usecols : array-like, default None | ||
Return a subset of the columns. All elements in this array must either | ||
usecols : array-like or callable, default None | ||
Return a subset of the columns. If array-like, all elements must either | ||
be positional (i.e. integer indices into the document columns) or strings | ||
that correspond to column names provided either by the user in `names` or | ||
inferred from the document header row(s). For example, a valid `usecols` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a mini-example of a callable
can you rebase, move whatsnew notes to 0.20.0, and add an example as indicated |
5f17d1d
to
ddbdb7a
Compare
Requested changes should have been made - let me know if there's anything else you would like to tweak for this enhancement |
@@ -31,6 +31,7 @@ Other enhancements | |||
^^^^^^^^^^^^^^^^^^ | |||
|
|||
- ``pd.read_excel`` now preserves sheet order when using ``sheetname=None`` (:issue:`9930`) | |||
- The ``usecols`` argument now accepts a callable function as a value (:issue:`14154`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usecols
argument in pd.read_csv
if callable(usecols): | ||
return set([i for i, name in enumerate(names) | ||
if com._apply_if_callable(usecols, name)]) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just return is fine here (no else)
msg = ("The elements of 'usecols' must " | ||
"either be all strings, all unicode, or all integers") | ||
msg = ("'usecols' must either be all strings, all unicode, " | ||
"all integers or callable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a callable
"either be all strings, all unicode, or all integers") | ||
|
||
msg = ("'usecols' must either be all strings, all unicode, " | ||
"all integers or callable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a callable
msg = ("The elements of 'usecols' must " | ||
"either be all strings, all unicode, or all integers") | ||
msg = ("'usecols' must either be all strings, all unicode, " | ||
"all integers or callable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -366,3 +367,25 @@ def test_np_array_usecols(self): | |||
expected = DataFrame([[1, 2]], columns=usecols) | |||
result = self.read_csv(StringIO(data), usecols=usecols) | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
def test_callable_usecols(self): | |||
s = '''AaA,bBb,CCC,ddd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number as a comment
@@ -437,7 +440,11 @@ cdef class TextReader: | |||
# suboptimal | |||
if usecols is not None: | |||
self.has_usecols = 1 | |||
self.usecols = set(usecols) | |||
if callable(usecols): | |||
self.callable_usecols = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not use this additional variable and instead just do a callable(self.usecalls)
I think its more obvious
@WillAyd thanks. couple more pretty minor comments. otherwise lgtm. |
1986920
to
5c933db
Compare
All of your requested changes should be accounted for. Let me know if anything else comes to mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the docstring and whatsnew, can you also update the IO docs in two places:
- overview of keywords: http://pandas.pydata.org/pandas-docs/stable/io.html#column-and-index-locations-and-names
- section on usecols: http://pandas.pydata.org/pandas-docs/stable/io.html#filtering-columns-usecols (I would add an example of a callable there)
results in much faster parsing time and lower memory usage. | ||
parameter would be [0, 1, 2], ['foo', 'bar', 'baz'] or lambda x: x.upper() | ||
in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster | ||
parsing time and lower memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add before the example some explanation how the callable is used? (I suppose it is called on each individual column name and should return True or False?)
And maybe also use ``
around the code example to better indicate that the "in ['AAA', 'BBB', 'DDD']" still belongs with the "lambda x: x.upper()"
def _evaluate_usecols(usecols, names): | ||
if callable(usecols): | ||
return set([i for i, name in enumerate(names) | ||
if com._apply_if_callable(usecols, name)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the com._apply_if_callable
is not needed anymore as you already did if callable(usecols)
? (apply_if_callable does exactly the same under the hood)
|
||
df = self.read_csv(StringIO(s), usecols=lambda x: | ||
x.upper() in ['AAA', 'BBB', 'DDD']) | ||
tm.assert_frame_equal(df, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test that lambda x: False
return an empty frame? (similar to the test_empty_usecols above)
name in self.usecols): | ||
usecols = set() | ||
if callable(self.usecols): | ||
if com._apply_if_callable(self.usecols, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, com._apply_if_callable
is not needed I think
5c933db
to
a1a05ca
Compare
I noticed that the Travis-CI build failed for OS X. I didn't see anything useful in the log and checking their website it appears that the failure may have stemmed from issues on their end and not necessarily from the code. Do you have a preferred method to restart that build? Should I close / re-open the PR or do you have another way you'd like to restart that test? |
I restarted the job. |
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter | ||
results in much faster parsing time and lower memory usage. | ||
inferred from the document header row(s). For example, a valid array-like | ||
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. If callable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is docs, you can format this to read easier. (use an ipython block), blank line
inferred from the document header row(s). For example, a valid array-like | ||
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. If | ||
callable, the callable function will be evaluated against the column names, | ||
returning names where the callable function evaluates to True. An example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't format this (like you can an .rst), but maybe a blank line or some bullet points?
@@ -976,17 +980,26 @@ def _is_index_col(col): | |||
return col is not None and col is not False | |||
|
|||
|
|||
def _evaluate_usecols(usecols, names): | |||
if callable(usecols): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give a mini-doc string here
x.upper() in ['AAA', 'BBB', 'DDD']) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
def test_callable_usecols_with_false_callable(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can put this test with the above one, just put a 1-line comment to separate them (I think they use the same input as well), to avoid duplication
@WillAyd just a couple of minor comments. ping when green. |
a1a05ca
to
9b0d112
Compare
@jreback pushed some edits to cover your latest comments. Let me know if there's anything else you'd like on this |
lgtm. ping on green. |
@WillAyd Thanks! |
git diff upstream/master | flake8 --diff