Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Better error message if usecols doesn't match columns #17310

Merged

Conversation

AaronCritchley
Copy link
Contributor

@AaronCritchley AaronCritchley commented Aug 22, 2017

GH17301: Improving the error message given when usecols
doesn't match with the columns provided.

NOTE: Do I need to add a whatsnew entry for something so small?
Happy to do so if needed.

GH17301: Improving the error message given when usecols
doesn't match with the columns provided.

Signed-off-by: Aaron Critchley <aaron.critchley@gmail.com>
@gfyoung gfyoung added the Error Reporting Incorrect or improved errors from pandas label Aug 22, 2017

if len(self.names) > len(usecols):
self.names = [n for i, n in enumerate(self.names)
if (i in usecols or n in usecols)]

if len(self.names) < len(usecols):
raise ValueError("Usecols do not match names.")
missing = [c for c in usecols if c not in self.orig_names]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it should be checking for membership in self.names.

@@ -481,7 +481,7 @@ def test_raise_on_usecols_names_mismatch(self):
data = 'a,b,c,d\n1,2,3,4\n5,6,7,8'

if self.engine == 'c':
msg = 'Usecols do not match names'
msg = "do not match columns, columns expected but not found"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have a stronger check for which columns are listed (both in the self.orig_names and self.names case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this non-helpful error an issue when you pass in engine='python' ? We would like to make sure that both engines are equally expressive about these types of errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the test, it'll require a minor reformat of the structure of the test - if that's not an issue happy to go ahead (unless wildcarding the regex for the arguments is sufficient).

Will see if it's also an issue on the Python engine and if so will implement a fix!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that you explicitly put in the string which elements are missing. If that's what you're going for, then go for it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung I've just tested with the Python engine, and the error is:

ValueError: ' column3' is not in list

Would we want the error to be consistent between both engines? If yes, would we prefer the proposed or existing format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your error message is better. Let's use it for both engines.

missing = [c for c in usecols if c not in self.orig_names]
raise ValueError(
"Usecols do not match columns, "
"columns expected but not found: %s" % missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .format(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, should we be using .format everywhere?

If yes there's a lot of places where we don't currently do that, I'd be happy to start changing some of those over too (in a seperate PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16130 (it's being worked on in chunks; feel free to pitch in)

raise ValueError(
"Usecols do not match columns, "
"columns expected but not found: %s" % missing
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@codecov
Copy link

codecov bot commented Oct 4, 2017

Codecov Report

Merging #17310 into master will decrease coverage by 0.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17310      +/-   ##
==========================================
- Coverage   91.01%   90.99%   -0.02%     
==========================================
  Files         162      162              
  Lines       49558    49560       +2     
==========================================
- Hits        45105    45097       -8     
- Misses       4453     4463      +10
Flag Coverage Δ
#multiple 88.77% <50%> (-0.01%) ⬇️
#single 40.25% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.39% <50%> (-0.06%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bec750...15d4786. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 4, 2017

Codecov Report

Merging #17310 into master will decrease coverage by 0.01%.
The diff coverage is 81.81%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17310      +/-   ##
==========================================
- Coverage   91.46%   91.44%   -0.02%     
==========================================
  Files         157      157              
  Lines       51447    51455       +8     
==========================================
- Hits        47055    47053       -2     
- Misses       4392     4402      +10
Flag Coverage Δ
#multiple 89.31% <81.81%> (-0.01%) ⬇️
#single 40.57% <9.09%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.55% <81.81%> (-0.05%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d163de7...5dfccdb. Read the comment docs.

missing = [c for c in usecols if c not in self.orig_names]
raise ValueError(
"Usecols do not match columns, "
"columns expected but not found: {}".format(missing)
Copy link
Member

@gfyoung gfyoung Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use keyword arguments in the string formatting. Same for your other error messages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I see a lot of duplicate code here. Let's abstract into a method (you can just create a private method outside of the class). That will make your life easier.

@@ -2442,6 +2450,14 @@ def _handle_usecols(self, columns, usecols_key):
raise ValueError("If using multiple headers, usecols must "
"be integers.")
col_indices = []

missing = [c for c in self.usecols if c not in usecols_key]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure what design patterns Pandas follows for this kind of thing, but would you rather me drop this down into a try/catch for col_indices.append(usecols_key.index(col)) where the error is currently being thrown?

I worry this initial approach adds needless computation time - but unsure if it's more readable? Let me know 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate further on that logic you provided: col_indices.append(usecols_key.index(col)). I'm not sure I fully follow where / how this would be added.

Secondly, don't worry about computation time. Get a working implementation first. Then we'll worry about optimizing, if need be. Chances are that won't be an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, at the moment the error is being raised a few lines further down when we're looking for the index of the column in the usecols_key list.

The alternate proposal would be something like:

try:
    col_indices.append(usecols_key.index(col))
except ValueError:
    # Calculate missing usecols and raise error accordingly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, understood. Yes, absolutely, let's do try-except.

@@ -492,11 +492,11 @@ def test_raise_on_usecols_names_mismatch(self):
tm.assert_frame_equal(df, expected)

usecols = ['a', 'b', 'c', 'f']
with tm.assert_raises_regex(ValueError, msg):
with tm.assert_raises_regex(ValueError, msg.format(missing=['f'])):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're repeating the same format for every scenario here, which seems silly, but if we were to extend this in the future it'd be useful to have this around (although trivial to implement if not) - would you rather me just have a string to check against and not bother formatting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the error I'm getting during testing is super confusing:

E           AssertionError: "Usecols do not match columns, columns expected but not found: ['f']" does not match "Usecols do not match columns, columns expected but not found: ['f']"

Am I being silly? Is it to do with the regex matching?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's see if we do the string formatting. We need to check that it raises on the right columns.
  • Yes, it's regex matching. So you'll need to add forward-slashes to the brackets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry on the error regex matching, that's me being silly, thanks!

@@ -520,9 +520,9 @@ def test_raise_on_usecols_names_mismatch(self):
# tm.assert_frame_equal(df, expected)

usecols = ['A', 'B', 'C', 'f']
with tm.assert_raises_regex(ValueError, msg):
with tm.assert_ra(ValueError, msg.format(missing=['f'])):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...what happened here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! Thank you!

@AaronCritchley
Copy link
Contributor Author

@jreback - I've addressed your requested changes, anything else that's needed?
@gfyoung - Thanks for all of your help, please let me know if you need anything else. 😄

def _validate_usecols(usecols, names):
"""
Validates that all usecols are present in a given
list of names, if not, raise a ValueError that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's remove the comma splice in this doc-string i.e.

"...names. If not, raise a ValueError that..."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's add a parameter description + return usecols (just in case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done & done!

I've used the doc format of the existing functions that were below, if this needs to change let me know and I'll do so.


Raises
------
ValueError : Detailing which usecols are missing, if any.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would word this slightly to say "Columns were missing. Error message will list them."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! @jreback thoughts?


Returns
-------
usecols : iterable of usecols
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a validate_usecols_arg function, is there a reason you are creating a new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, was going from @gfyoung's suggestion.
We could extend the existing function, at the moment it's only taking usecols as an arg so would be extending it's arguments as well as logic.
I've not checked if every call to validate_usecols_arg would have a names argument to pass through, so may need to default it? Although I would think we'd always have column names to check against, right?

Let me know if this is a 'must-have' for you and I'll implement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it certainly can be an optional argument
having 2 functions do similar things is confusing

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Oct 12, 2017
@AaronCritchley
Copy link
Contributor Author

Hey @gfyoung / @jreback, sorry for the delay.

I'm just looking at how I'd refactor these two functions (to refresh memory, that's the current validate_usecols_arg and the new, validate_usecols into one) and can't figure out a way to do it that wouldn't result in a mess.

Some of the stuff in the CParserWrapper relies on the output of validate_usecols_arg to work, while in the PythonParser, the point where we're calling it I don't think we have enough context to be able to drop the new logic in (all the logic in _infer_columns would need to be moved around I think?).

I'm really happy to do this if it's a must have for this to be accepted, but I'm afraid I'd need a little guidance on where/how to move things so I don't shoot in the dark and make things worse!

Thanks!

@gfyoung
Copy link
Member

gfyoung commented Nov 8, 2017

@AaronCritchley : The integration isn't so bad. Just do the following per @jreback suggestion

def _validate_usecols_arg(usecols, names=None):
     if names is None:
          # use original logic in _validate_usecols_arg
     else:
          # use your logic in _validate_usecols

@AaronCritchley
Copy link
Contributor Author

Oh, so just two different branches of logic in that function? It feels a little weird that in some cases it would be returning two values, and in others returning none. If it's a design pattern you're happy for me to take then I can implement though

@gfyoung
Copy link
Member

gfyoung commented Nov 8, 2017

Python is not a declarative language, so we can do whatever we want. 😄

Write it out, and we can take a look afterwards.

Parameters
----------
usecols : array-like, callable, or None
List of columns to use when parsing or a callable that can be used
to filter a list of table columns.
names: iterable, default None
Iterable of names to check usecols against.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a raises section

return set(usecols), usecols_dtype
return usecols, None
else:
missing = [c for c in usecols if c not in names]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason if names is not None you wouldn’t also do the usecols logic?
you can’t check names if usecols is a callable; so maybe this check needs to happen separately (and after) the original

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the names that we're passing through actually depend on the result from the first call to _validate_usecols_arg, so we'd have already executed that code at least once.

The logic between the first and second call differs in the C and Python parsers which is why I was concerned about getting it into one neat function, there are a few steps between before we're checking usecols against names.

I can run the original code everytime we call the function if that's preferable though? Up to you! 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes revert to the original and give it a new name

Copy link
Contributor Author

@AaronCritchley AaronCritchley Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you OK with _check_usecols_in_names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe _validate_usecols_names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, let me know if further changes are needed.

Thanks for the help!

@AaronCritchley AaronCritchley force-pushed the enhancement/better-usecols-message branch from 0d013c5 to 8a06cee Compare November 8, 2017 23:14
@AaronCritchley
Copy link
Contributor Author

Hey @jreback, would you like any further changes or is this good to go?

@AaronCritchley
Copy link
Contributor Author

Ping @jreback - happy to make more changes if needed, would be great to get this in to master 😄

@jreback jreback removed this from the Next Major Release milestone Nov 29, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add a whatsnew note in 0.22. in other enhancements.

@@ -492,11 +492,11 @@ def test_raise_on_usecols_names_mismatch(self):
tm.assert_frame_equal(df, expected)

usecols = ['a', 'b', 'c', 'f']
with tm.assert_raises_regex(ValueError, msg):
with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test where you have 2 missing cols?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do this and add the whatsnew entry tomorrow, thanks!

if len(missing) > 0:
raise ValueError(
"Usecols do not match columns, "
"columns expected but not found: {missing}".format(missing=missing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might need to ', '.join(missing) here, not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> l = [1, 'foo', [2,3]]
>>> 'Formatted l: {}'.format(l)
"Formatted l: [1, 'foo', [2, 3]]"

if you'd prefer a different error message, I'd be happy to use join though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that’s fine i guess format is pretty smart about this ok!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wording change. push and ping on green.

@@ -76,6 +76,7 @@ Other Enhancements
- Improved wording of ``ValueError`` raised in :func:`to_datetime` when ``unit=`` is passed with a non-convertible value (:issue:`14350`)
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`)
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`)
- Improved wording of ``ValueError`` raised when creating a DataFrame and the ``usecols`` argument cannot match all columns. (:issue:`17301`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say this is for :func:`read_csv` , not DataFrame construction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will change now!

@jreback jreback added this to the 0.22.0 milestone Dec 2, 2017
@jreback jreback added the IO CSV read_csv, to_csv label Dec 2, 2017
@AaronCritchley
Copy link
Contributor Author

@jreback Green 😄

@jreback jreback merged commit 7a3f81a into pandas-dev:master Dec 3, 2017
@jreback
Copy link
Contributor

jreback commented Dec 3, 2017

thanks @AaronCritchley

@AaronCritchley AaronCritchley deleted the enhancement/better-usecols-message branch December 3, 2017 16:15
@jreback
Copy link
Contributor

jreback commented Dec 21, 2017

I think these regexes need a r to make them raw; these warnings are on 3.6

DeprecationWarning: invalid escape sequence \[
  with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):
/home/travis/build/pandas-dev/pandas/pandas/tests/io/parser/usecols.py:499: DeprecationWarning: invalid escape sequence \[
  with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):
/home/travis/build/pandas-dev/pandas/pandas/tests/io/parser/usecols.py:504: DeprecationWarning: invalid escape sequence \[
  ValueError, msg.format(missing="\[('f', 'g'|'g', 'f')\]")):
/home/travis/build/pandas-dev/pandas/pandas/tests/io/parser/usecols.py:528: DeprecationWarning: invalid escape sequence \[
  with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):
/home/travis/build/pandas-dev/pandas/pandas/tests/io/parser/usecols.py:532: DeprecationWarning: invalid escape sequence \[
  with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):

https://travis-ci.org/pandas-dev/pandas/jobs/319713338

there are a bunch more as well.

@gfyoung
cc @AaronCritchley

@AaronCritchley
Copy link
Contributor Author

Will look into this over the weekend, ty for the heads up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the error message when usecols cannot match all columns
4 participants