ENH: Better error message if usecols doesn't match columns #17310

AaronCritchley · 2017-08-22T16:03:14Z

GH17301: Improving the error message given when usecols
doesn't match with the columns provided.

closes Improve the error message when usecols cannot match all columns #17301
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

NOTE: Do I need to add a whatsnew entry for something so small?
Happy to do so if needed.

GH17301: Improving the error message given when usecols doesn't match with the columns provided. Signed-off-by: Aaron Critchley <aaron.critchley@gmail.com>

gfyoung · 2017-08-22T16:26:41Z

pandas/io/parsers.py


            if len(self.names) > len(usecols):
                self.names = [n for i, n in enumerate(self.names)
                              if (i in usecols or n in usecols)]

            if len(self.names) < len(usecols):
-                raise ValueError("Usecols do not match names.")
+                missing = [c for c in usecols if c not in self.orig_names]


In this case, it should be checking for membership in self.names.

gfyoung · 2017-08-22T16:27:20Z

pandas/tests/io/parser/usecols.py

@@ -481,7 +481,7 @@ def test_raise_on_usecols_names_mismatch(self):
        data = 'a,b,c,d\n1,2,3,4\n5,6,7,8'

        if self.engine == 'c':
-            msg = 'Usecols do not match names'
+            msg = "do not match columns, columns expected but not found"


I think it would be good to have a stronger check for which columns are listed (both in the self.orig_names and self.names case.

Also, is this non-helpful error an issue when you pass in engine='python' ? We would like to make sure that both engines are equally expressive about these types of errors.

Agree on the test, it'll require a minor reformat of the structure of the test - if that's not an issue happy to go ahead (unless wildcarding the regex for the arguments is sufficient).

Will see if it's also an issue on the Python engine and if so will implement a fix!

I would prefer that you explicitly put in the string which elements are missing. If that's what you're going for, then go for it!

@gfyoung I've just tested with the Python engine, and the error is:

ValueError: ' column3' is not in list

Would we want the error to be consistent between both engines? If yes, would we prefer the proposed or existing format?

Your error message is better. Let's use it for both engines.

jreback · 2017-08-23T12:02:20Z

pandas/io/parsers.py

+                missing = [c for c in usecols if c not in self.orig_names]
+                raise ValueError(
+                    "Usecols do not match columns, "
+                    "columns expected but not found: %s" % missing


use .format(...)

Sure, should we be using .format everywhere?

If yes there's a lot of places where we don't currently do that, I'd be happy to start changing some of those over too (in a seperate PR).

#16130 (it's being worked on in chunks; feel free to pitch in)

jreback · 2017-08-23T12:02:31Z

pandas/io/parsers.py

+                raise ValueError(
+                    "Usecols do not match columns, "
+                    "columns expected but not found: %s" % missing
+                )


codecov · 2017-10-04T19:08:15Z

Codecov Report

Merging #17310 into master will decrease coverage by 0.01%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master   #17310      +/-   ##
==========================================
- Coverage   91.01%   90.99%   -0.02%     
==========================================
  Files         162      162              
  Lines       49558    49560       +2     
==========================================
- Hits        45105    45097       -8     
- Misses       4453     4463      +10

Flag	Coverage Δ
#multiple	`88.77% <50%> (-0.01%)`	⬇️
#single	`40.25% <0%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.39% <50%> (-0.06%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.72% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bec750...15d4786. Read the comment docs.

codecov · 2017-10-04T19:08:24Z

Codecov Report

Merging #17310 into master will decrease coverage by 0.01%.
The diff coverage is 81.81%.

@@            Coverage Diff             @@
##           master   #17310      +/-   ##
==========================================
- Coverage   91.46%   91.44%   -0.02%     
==========================================
  Files         157      157              
  Lines       51447    51455       +8     
==========================================
- Hits        47055    47053       -2     
- Misses       4392     4402      +10

Flag	Coverage Δ
#multiple	`89.31% <81.81%> (-0.01%)`	⬇️
#single	`40.57% <9.09%> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.55% <81.81%> (-0.05%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.81% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d163de7...5dfccdb. Read the comment docs.

gfyoung · 2017-10-04T19:09:19Z

pandas/io/parsers.py

+                missing = [c for c in usecols if c not in self.orig_names]
+                raise ValueError(
+                    "Usecols do not match columns, "
+                    "columns expected but not found: {}".format(missing)


Let's use keyword arguments in the string formatting. Same for your other error messages.

Also, I see a lot of duplicate code here. Let's abstract into a method (you can just create a private method outside of the class). That will make your life easier.

AaronCritchley · 2017-10-04T19:10:19Z

pandas/io/parsers.py

@@ -2442,6 +2450,14 @@ def _handle_usecols(self, columns, usecols_key):
                    raise ValueError("If using multiple headers, usecols must "
                                     "be integers.")
                col_indices = []
+
+                missing = [c for c in self.usecols if c not in usecols_key]


I'm unsure what design patterns Pandas follows for this kind of thing, but would you rather me drop this down into a try/catch for col_indices.append(usecols_key.index(col)) where the error is currently being thrown?

I worry this initial approach adds needless computation time - but unsure if it's more readable? Let me know 😄

Could you elaborate further on that logic you provided: col_indices.append(usecols_key.index(col)). I'm not sure I fully follow where / how this would be added.

Secondly, don't worry about computation time. Get a working implementation first. Then we'll worry about optimizing, if need be. Chances are that won't be an issue.

Sure, at the moment the error is being raised a few lines further down when we're looking for the index of the column in the usecols_key list.

The alternate proposal would be something like:

try: col_indices.append(usecols_key.index(col)) except ValueError: # Calculate missing usecols and raise error accordingly

Ah, understood. Yes, absolutely, let's do try-except.

AaronCritchley · 2017-10-04T19:40:18Z

pandas/tests/io/parser/usecols.py

@@ -492,11 +492,11 @@ def test_raise_on_usecols_names_mismatch(self):
        tm.assert_frame_equal(df, expected)

        usecols = ['a', 'b', 'c', 'f']
-        with tm.assert_raises_regex(ValueError, msg):
+        with tm.assert_raises_regex(ValueError, msg.format(missing=['f'])):


So we're repeating the same format for every scenario here, which seems silly, but if we were to extend this in the future it'd be useful to have this around (although trivial to implement if not) - would you rather me just have a string to check against and not bother formatting?

Also, the error I'm getting during testing is super confusing:

E AssertionError: "Usecols do not match columns, columns expected but not found: ['f']" does not match "Usecols do not match columns, columns expected but not found: ['f']"

Am I being silly? Is it to do with the regex matching?

Let's see if we do the string formatting. We need to check that it raises on the right columns.

Yes, it's regex matching. So you'll need to add forward-slashes to the brackets.

Sorry on the error regex matching, that's me being silly, thanks!

gfyoung · 2017-10-04T19:44:05Z

pandas/tests/io/parser/usecols.py

@@ -520,9 +520,9 @@ def test_raise_on_usecols_names_mismatch(self):
        # tm.assert_frame_equal(df, expected)

        usecols = ['A', 'B', 'C', 'f']
-        with tm.assert_raises_regex(ValueError, msg):
+        with tm.assert_ra(ValueError, msg.format(missing=['f'])):


Hmm...what happened here?

Whoops! Thank you!

AaronCritchley · 2017-10-04T22:41:15Z

@jreback - I've addressed your requested changes, anything else that's needed?
@gfyoung - Thanks for all of your help, please let me know if you need anything else. 😄

gfyoung · 2017-10-04T23:31:55Z

pandas/io/parsers.py

+def _validate_usecols(usecols, names):
+    """
+    Validates that all usecols are present in a given
+    list of names, if not, raise a ValueError that


Nit: let's remove the comma splice in this doc-string i.e.

"...names. If not, raise a ValueError that..."

Also, let's add a parameter description + return usecols (just in case).

Done & done!

I've used the doc format of the existing functions that were below, if this needs to change let me know and I'll do so.

gfyoung · 2017-10-04T23:48:06Z

pandas/io/parsers.py

+
+    Raises
+    ------
+    ValueError : Detailing which usecols are missing, if any.


I would word this slightly to say "Columns were missing. Error message will list them."

gfyoung

Very nice! @jreback thoughts?

jreback · 2017-10-05T10:32:26Z

pandas/io/parsers.py

+
+    Returns
+    -------
+    usecols : iterable of usecols


there is a validate_usecols_arg function, is there a reason you are creating a new one?

Sure, was going from @gfyoung's suggestion.
We could extend the existing function, at the moment it's only taking usecols as an arg so would be extending it's arguments as well as logic.
I've not checked if every call to validate_usecols_arg would have a names argument to pass through, so may need to default it? Although I would think we'd always have column names to check against, right?

Let me know if this is a 'must-have' for you and I'll implement.

it certainly can be an optional argument
having 2 functions do similar things is confusing

AaronCritchley · 2017-11-08T00:30:13Z

Hey @gfyoung / @jreback, sorry for the delay.

I'm just looking at how I'd refactor these two functions (to refresh memory, that's the current validate_usecols_arg and the new, validate_usecols into one) and can't figure out a way to do it that wouldn't result in a mess.

Some of the stuff in the CParserWrapper relies on the output of validate_usecols_arg to work, while in the PythonParser, the point where we're calling it I don't think we have enough context to be able to drop the new logic in (all the logic in _infer_columns would need to be moved around I think?).

I'm really happy to do this if it's a must have for this to be accepted, but I'm afraid I'd need a little guidance on where/how to move things so I don't shoot in the dark and make things worse!

Thanks!

gfyoung · 2017-11-08T00:33:31Z

@AaronCritchley : The integration isn't so bad. Just do the following per @jreback suggestion

def _validate_usecols_arg(usecols, names=None):
     if names is None:
          # use original logic in _validate_usecols_arg
     else:
          # use your logic in _validate_usecols

AaronCritchley · 2017-11-08T00:48:57Z

Oh, so just two different branches of logic in that function? It feels a little weird that in some cases it would be returning two values, and in others returning none. If it's a design pattern you're happy for me to take then I can implement though

gfyoung · 2017-11-08T00:56:10Z

Python is not a declarative language, so we can do whatever we want. 😄

Write it out, and we can take a look afterwards.

jreback · 2017-11-08T02:38:12Z

pandas/io/parsers.py

    Parameters
    ----------
    usecols : array-like, callable, or None
        List of columns to use when parsing or a callable that can be used
        to filter a list of table columns.
+    names: iterable, default None
+        Iterable of names to check usecols against.


add a raises section

jreback · 2017-11-08T02:42:00Z

pandas/io/parsers.py

+            return set(usecols), usecols_dtype
+        return usecols, None
+    else:
+        missing = [c for c in usecols if c not in names]


any reason if names is not None you wouldn’t also do the usecols logic?
you can’t check names if usecols is a callable; so maybe this check needs to happen separately (and after) the original

Yeah, the names that we're passing through actually depend on the result from the first call to _validate_usecols_arg, so we'd have already executed that code at least once.

The logic between the first and second call differs in the C and Python parsers which is why I was concerned about getting it into one neat function, there are a few steps between before we're checking usecols against names.

I can run the original code everytime we call the function if that's preferable though? Up to you! 😄

yes revert to the original and give it a new name

Are you OK with _check_usecols_in_names?

maybe _validate_usecols_names

Done, let me know if further changes are needed.

Thanks for the help!

AaronCritchley · 2017-11-13T18:53:07Z

Hey @jreback, would you like any further changes or is this good to go?

AaronCritchley · 2017-11-28T21:52:54Z

Ping @jreback - happy to make more changes if needed, would be great to get this in to master 😄

jreback

you can add a whatsnew note in 0.22. in other enhancements.

jreback · 2017-11-29T00:09:25Z

pandas/tests/io/parser/usecols.py

@@ -492,11 +492,11 @@ def test_raise_on_usecols_names_mismatch(self):
        tm.assert_frame_equal(df, expected)

        usecols = ['a', 'b', 'c', 'f']
-        with tm.assert_raises_regex(ValueError, msg):
+        with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):


can you add a test where you have 2 missing cols?

Yep, will do this and add the whatsnew entry tomorrow, thanks!

jreback · 2017-11-29T00:09:52Z

pandas/io/parsers.py

+    if len(missing) > 0:
+        raise ValueError(
+            "Usecols do not match columns, "
+            "columns expected but not found: {missing}".format(missing=missing)


you might need to ', '.join(missing) here, not sure.

Python 3.6.3 |Anaconda, Inc.| (default, Nov 8 2017, 18:10:31) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> l = [1, 'foo', [2,3]] >>> 'Formatted l: {}'.format(l) "Formatted l: [1, 'foo', [2, 3]]"

if you'd prefer a different error message, I'd be happy to use join though.

that’s fine i guess format is pretty smart about this ok!

jreback

wording change. push and ping on green.

jreback · 2017-12-02T15:56:25Z

doc/source/whatsnew/v0.22.0.txt

@@ -76,6 +76,7 @@ Other Enhancements
 - Improved wording of ``ValueError`` raised in :func:`to_datetime` when ``unit=`` is passed with a non-convertible value (:issue:`14350`)
 - :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`)
 - :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`)
+- Improved wording of ``ValueError`` raised when creating a DataFrame and the ``usecols`` argument cannot match all columns. (:issue:`17301`)


say this is for :func:`read_csv` , not DataFrame construction.

Got it, will change now!

AaronCritchley · 2017-12-03T01:16:03Z

@jreback Green 😄

jreback · 2017-12-03T15:26:54Z

thanks @AaronCritchley

jreback · 2017-12-21T14:59:48Z

I think these regexes need a r to make them raw; these warnings are on 3.6

DeprecationWarning: invalid escape sequence \[
  with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):
/home/travis/build/pandas-dev/pandas/pandas/tests/io/parser/usecols.py:499: DeprecationWarning: invalid escape sequence \[
  with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):
/home/travis/build/pandas-dev/pandas/pandas/tests/io/parser/usecols.py:504: DeprecationWarning: invalid escape sequence \[
  ValueError, msg.format(missing="\[('f', 'g'|'g', 'f')\]")):
/home/travis/build/pandas-dev/pandas/pandas/tests/io/parser/usecols.py:528: DeprecationWarning: invalid escape sequence \[
  with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):
/home/travis/build/pandas-dev/pandas/pandas/tests/io/parser/usecols.py:532: DeprecationWarning: invalid escape sequence \[
  with tm.assert_raises_regex(ValueError, msg.format(missing="\['f'\]")):

https://travis-ci.org/pandas-dev/pandas/jobs/319713338

there are a bunch more as well.

@gfyoung
cc @AaronCritchley

AaronCritchley · 2017-12-21T17:00:00Z

Will look into this over the weekend, ty for the heads up!

ENH: better error message if usecols doesn't match

b0e102a

GH17301: Improving the error message given when usecols doesn't match with the columns provided. Signed-off-by: Aaron Critchley <aaron.critchley@gmail.com>

gfyoung added the Error Reporting Incorrect or improved errors from pandas label Aug 22, 2017

gfyoung reviewed Aug 22, 2017

View reviewed changes

jreback requested changes Aug 23, 2017

View reviewed changes

WIP: Changing Python parser behaviour, better tests

15d4786

gfyoung reviewed Oct 4, 2017

View reviewed changes

AaronCritchley commented Oct 4, 2017

View reviewed changes

WIP: Implementing suggested changes

841a6cc

AaronCritchley commented Oct 4, 2017

View reviewed changes

gfyoung reviewed Oct 4, 2017

View reviewed changes

Fixing silly error in tests

dced1b7

gfyoung reviewed Oct 4, 2017

View reviewed changes

Better documentation as per @gfyoung

5bf89a8

gfyoung reviewed Oct 4, 2017

View reviewed changes

Changing ValueError documentation

ba93833

gfyoung approved these changes Oct 5, 2017

View reviewed changes

jreback requested changes Oct 5, 2017

View reviewed changes

TomAugspurger added this to the Next Major Release milestone Oct 12, 2017

jreback requested changes Nov 8, 2017

View reviewed changes

Reverting, changing name of function

8a06cee

AaronCritchley force-pushed the enhancement/better-usecols-message branch from 0d013c5 to 8a06cee Compare November 8, 2017 23:14

Merge branch 'master' into enhancement/better-usecols-message

1afb4c1

jreback removed this from the Next Major Release milestone Nov 29, 2017

jreback requested changes Nov 29, 2017

View reviewed changes

AaronCritchley added 2 commits December 2, 2017 15:22

Merge branch 'master' into enhancement/better-usecols-message

5a8a852

Adding whatsnews entry, and test for multiple missing columns

2209eae

jreback requested changes Dec 2, 2017

View reviewed changes

jreback added this to the 0.22.0 milestone Dec 2, 2017

jreback added the IO CSV read_csv, to_csv label Dec 2, 2017

AaronCritchley added 2 commits December 2, 2017 15:58

Adding accidentally removed test back

93185f5

Suggested whatsnew change

5dfccdb

jreback approved these changes Dec 3, 2017

View reviewed changes

jreback merged commit 7a3f81a into pandas-dev:master Dec 3, 2017

AaronCritchley deleted the enhancement/better-usecols-message branch December 3, 2017 16:15

AaronCritchley mentioned this pull request Dec 23, 2017

Fixing 3.6 Escape Sequence Deprecations in tests/io/parser/usecols.py #18918

Merged

ENH: Better error message if usecols doesn't match columns #17310

ENH: Better error message if usecols doesn't match columns #17310

Conversation

AaronCritchley commented Aug 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 4, 2017

Codecov Report

codecov bot commented Oct 4, 2017 • edited Loading

Codecov Report

gfyoung Oct 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AaronCritchley commented Oct 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AaronCritchley commented Nov 8, 2017

gfyoung commented Nov 8, 2017

AaronCritchley commented Nov 8, 2017

gfyoung commented Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AaronCritchley Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AaronCritchley commented Nov 13, 2017

AaronCritchley commented Nov 28, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AaronCritchley commented Dec 3, 2017

jreback commented Dec 3, 2017

jreback commented Dec 21, 2017

AaronCritchley commented Dec 21, 2017

AaronCritchley commented Aug 22, 2017 •

edited

Loading

codecov bot commented Oct 4, 2017 •

edited

Loading

gfyoung Oct 4, 2017 •

edited

Loading

gfyoung commented Nov 8, 2017 •

edited

Loading

AaronCritchley Nov 8, 2017 •

edited

Loading