-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Better error message if usecols doesn't match columns #17310
ENH: Better error message if usecols doesn't match columns #17310
Conversation
GH17301: Improving the error message given when usecols doesn't match with the columns provided. Signed-off-by: Aaron Critchley <aaron.critchley@gmail.com>
pandas/io/parsers.py
Outdated
|
||
if len(self.names) > len(usecols): | ||
self.names = [n for i, n in enumerate(self.names) | ||
if (i in usecols or n in usecols)] | ||
|
||
if len(self.names) < len(usecols): | ||
raise ValueError("Usecols do not match names.") | ||
missing = [c for c in usecols if c not in self.orig_names] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, it should be checking for membership in self.names
.
pandas/tests/io/parser/usecols.py
Outdated
@@ -481,7 +481,7 @@ def test_raise_on_usecols_names_mismatch(self): | |||
data = 'a,b,c,d\n1,2,3,4\n5,6,7,8' | |||
|
|||
if self.engine == 'c': | |||
msg = 'Usecols do not match names' | |||
msg = "do not match columns, columns expected but not found" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to have a stronger check for which columns are listed (both in the self.orig_names
and self.names
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is this non-helpful error an issue when you pass in engine='python'
? We would like to make sure that both engines are equally expressive about these types of errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on the test, it'll require a minor reformat of the structure of the test - if that's not an issue happy to go ahead (unless wildcarding the regex for the arguments is sufficient).
Will see if it's also an issue on the Python engine and if so will implement a fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that you explicitly put in the string which elements are missing. If that's what you're going for, then go for it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung I've just tested with the Python engine, and the error is:
ValueError: ' column3' is not in list
Would we want the error to be consistent between both engines? If yes, would we prefer the proposed or existing format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your error message is better. Let's use it for both engines.
pandas/io/parsers.py
Outdated
missing = [c for c in usecols if c not in self.orig_names] | ||
raise ValueError( | ||
"Usecols do not match columns, " | ||
"columns expected but not found: %s" % missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use .format(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, should we be using .format everywhere?
If yes there's a lot of places where we don't currently do that, I'd be happy to start changing some of those over too (in a seperate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16130 (it's being worked on in chunks; feel free to pitch in)
pandas/io/parsers.py
Outdated
raise ValueError( | ||
"Usecols do not match columns, " | ||
"columns expected but not found: %s" % missing | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Codecov Report
@@ Coverage Diff @@
## master #17310 +/- ##
==========================================
- Coverage 91.01% 90.99% -0.02%
==========================================
Files 162 162
Lines 49558 49560 +2
==========================================
- Hits 45105 45097 -8
- Misses 4453 4463 +10
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17310 +/- ##
==========================================
- Coverage 91.46% 91.44% -0.02%
==========================================
Files 157 157
Lines 51447 51455 +8
==========================================
- Hits 47055 47053 -2
- Misses 4392 4402 +10
Continue to review full report at Codecov.
|
pandas/io/parsers.py
Outdated
missing = [c for c in usecols if c not in self.orig_names] | ||
raise ValueError( | ||
"Usecols do not match columns, " | ||
"columns expected but not found: {}".format(missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use keyword arguments in the string formatting. Same for your other error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I see a lot of duplicate code here. Let's abstract into a method (you can just create a private method outside of the class). That will make your life easier.
pandas/io/parsers.py
Outdated
@@ -2442,6 +2450,14 @@ def _handle_usecols(self, columns, usecols_key): | |||
raise ValueError("If using multiple headers, usecols must " | |||
"be integers.") | |||
col_indices = [] | |||
|
|||
missing = [c for c in self.usecols if c not in usecols_key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure what design patterns Pandas follows for this kind of thing, but would you rather me drop this down into a try/catch for col_indices.append(usecols_key.index(col))
where the error is currently being thrown?
I worry this initial approach adds needless computation time - but unsure if it's more readable? Let me know 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate further on that logic you provided: col_indices.append(usecols_key.index(col))
. I'm not sure I fully follow where / how this would be added.
Secondly, don't worry about computation time. Get a working implementation first. Then we'll worry about optimizing, if need be. Chances are that won't be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, at the moment the error is being raised a few lines further down when we're looking for the index of the column in the usecols_key
list.
The alternate proposal would be something like:
try:
col_indices.append(usecols_key.index(col))
except ValueError:
# Calculate missing usecols and raise error accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, understood. Yes, absolutely, let's do try-except
.
pandas/tests/io/parser/usecols.py
Outdated
@@ -492,11 +492,11 @@ def test_raise_on_usecols_names_mismatch(self): | |||
tm.assert_frame_equal(df, expected) | |||
|
|||
usecols = ['a', 'b', 'c', 'f'] | |||
with tm.assert_raises_regex(ValueError, msg): | |||
with tm.assert_raises_regex(ValueError, msg.format(missing=['f'])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're repeating the same format for every scenario here, which seems silly, but if we were to extend this in the future it'd be useful to have this around (although trivial to implement if not) - would you rather me just have a string to check against and not bother formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the error I'm getting during testing is super confusing:
E AssertionError: "Usecols do not match columns, columns expected but not found: ['f']" does not match "Usecols do not match columns, columns expected but not found: ['f']"
Am I being silly? Is it to do with the regex matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's see if we do the string formatting. We need to check that it raises on the right columns.
- Yes, it's regex matching. So you'll need to add forward-slashes to the brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry on the error regex matching, that's me being silly, thanks!
pandas/tests/io/parser/usecols.py
Outdated
@@ -520,9 +520,9 @@ def test_raise_on_usecols_names_mismatch(self): | |||
# tm.assert_frame_equal(df, expected) | |||
|
|||
usecols = ['A', 'B', 'C', 'f'] | |||
with tm.assert_raises_regex(ValueError, msg): | |||
with tm.assert_ra(ValueError, msg.format(missing=['f'])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...what happened here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! Thank you!
pandas/io/parsers.py
Outdated
def _validate_usecols(usecols, names): | ||
""" | ||
Validates that all usecols are present in a given | ||
list of names, if not, raise a ValueError that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's remove the comma splice in this doc-string i.e.
"...names. If not, raise a ValueError that..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, let's add a parameter description + return usecols
(just in case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done & done!
I've used the doc format of the existing functions that were below, if this needs to change let me know and I'll do so.
pandas/io/parsers.py
Outdated
|
||
Raises | ||
------ | ||
ValueError : Detailing which usecols are missing, if any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would word this slightly to say "Columns were missing. Error message will list them."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! @jreback thoughts?
|
||
Returns | ||
------- | ||
usecols : iterable of usecols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a validate_usecols_arg function, is there a reason you are creating a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, was going from @gfyoung's suggestion.
We could extend the existing function, at the moment it's only taking usecols
as an arg so would be extending it's arguments as well as logic.
I've not checked if every call to validate_usecols_arg
would have a names
argument to pass through, so may need to default it? Although I would think we'd always have column names to check against, right?
Let me know if this is a 'must-have' for you and I'll implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it certainly can be an optional argument
having 2 functions do similar things is confusing
Hey @gfyoung / @jreback, sorry for the delay. I'm just looking at how I'd refactor these two functions (to refresh memory, that's the current Some of the stuff in the CParserWrapper relies on the output of I'm really happy to do this if it's a must have for this to be accepted, but I'm afraid I'd need a little guidance on where/how to move things so I don't shoot in the dark and make things worse! Thanks! |
@AaronCritchley : The integration isn't so bad. Just do the following per @jreback suggestion def _validate_usecols_arg(usecols, names=None):
if names is None:
# use original logic in _validate_usecols_arg
else:
# use your logic in _validate_usecols |
Oh, so just two different branches of logic in that function? It feels a little weird that in some cases it would be returning two values, and in others returning none. If it's a design pattern you're happy for me to take then I can implement though |
Python is not a declarative language, so we can do whatever we want. 😄 Write it out, and we can take a look afterwards. |
pandas/io/parsers.py
Outdated
Parameters | ||
---------- | ||
usecols : array-like, callable, or None | ||
List of columns to use when parsing or a callable that can be used | ||
to filter a list of table columns. | ||
names: iterable, default None | ||
Iterable of names to check usecols against. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a raises section
pandas/io/parsers.py
Outdated
return set(usecols), usecols_dtype | ||
return usecols, None | ||
else: | ||
missing = [c for c in usecols if c not in names] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason if names is not None you wouldn’t also do the usecols logic?
you can’t check names if usecols is a callable; so maybe this check needs to happen separately (and after) the original
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the names that we're passing through actually depend on the result from the first call to _validate_usecols_arg
, so we'd have already executed that code at least once.
The logic between the first and second call differs in the C and Python parsers which is why I was concerned about getting it into one neat function, there are a few steps between before we're checking usecols against names.
I can run the original code everytime we call the function if that's preferable though? Up to you! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes revert to the original and give it a new name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you OK with _check_usecols_in_names
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe _validate_usecols_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, let me know if further changes are needed.
Thanks for the help!
0d013c5
to
8a06cee
Compare
Hey @jreback, would you like any further changes or is this good to go? |
Ping @jreback - happy to make more changes if needed, would be great to get this in to master 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add a whatsnew note in 0.22. in other enhancements.
@@ -492,11 +492,11 @@ def test_raise_on_usecols_names_mismatch(self): | |||
tm.assert_frame_equal(df, expected) | |||
|
|||
usecols = ['a', 'b', 'c', 'f'] | |||
with tm.assert_raises_regex(ValueError, msg): | |||
with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test where you have 2 missing cols?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do this and add the whatsnew entry tomorrow, thanks!
if len(missing) > 0: | ||
raise ValueError( | ||
"Usecols do not match columns, " | ||
"columns expected but not found: {missing}".format(missing=missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might need to ', '.join(missing) here, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3.6.3 |Anaconda, Inc.| (default, Nov 8 2017, 18:10:31)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> l = [1, 'foo', [2,3]]
>>> 'Formatted l: {}'.format(l)
"Formatted l: [1, 'foo', [2, 3]]"
if you'd prefer a different error message, I'd be happy to use join though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that’s fine i guess format is pretty smart about this ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wording change. push and ping on green.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -76,6 +76,7 @@ Other Enhancements | |||
- Improved wording of ``ValueError`` raised in :func:`to_datetime` when ``unit=`` is passed with a non-convertible value (:issue:`14350`) | |||
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`) | |||
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`) | |||
- Improved wording of ``ValueError`` raised when creating a DataFrame and the ``usecols`` argument cannot match all columns. (:issue:`17301`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say this is for :func:`read_csv`
, not DataFrame construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, will change now!
@jreback Green 😄 |
thanks @AaronCritchley |
I think these regexes need a
https://travis-ci.org/pandas-dev/pandas/jobs/319713338 there are a bunch more as well. |
Will look into this over the weekend, ty for the heads up! |
GH17301: Improving the error message given when usecols
doesn't match with the columns provided.
usecols
cannot match all columns #17301git diff upstream/master -u -- "*.py" | flake8 --diff
NOTE: Do I need to add a whatsnew entry for something so small?
Happy to do so if needed.