-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674
Conversation
|
- Updated tests - Updated whatsnew 0.19.2 note - Added new parameter file_header for CParserWrapper to contain the original header read from the file for comparison
Current coverage is 85.29% (diff: 100%)@@ master #14674 diff @@
==========================================
Files 145 140 -5
Lines 51090 50723 -367
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43315 43263 -52
+ Misses 7775 7460 -315
Partials 0 0
|
can you rebase |
1 similar comment
can you rebase |
0b15cd1
to
a985129
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I noted in the issue, this PR is certainly still welcome!
Added some comments.
@@ -32,6 +32,7 @@ Bug Fixes | |||
|
|||
|
|||
|
|||
- Bug in pd.read_csv - catch missing columns if usecols and header lengths match (:issue:`14671`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to 0.20.0.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.read_csv()
instead ofpd.read_csv
- Let's generalize this a little. This PR is not actually about handling when
header
andusecols
length match. It's about properly handling situations whenusecols
provides non-existent columns.
usecol_len = len(set(self.usecols) - set(h)) | ||
usecoli_len = len(set(self.usecols) - set(range(0, len(h)))) | ||
if usecol_len > 0 and usecoli_len > 0: | ||
raise ValueError("Usecols do not match names.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those code block seems a bit convoluted. Can this be simplified?
Is there a reason you need _reader.file_header
. Aren't those already in names?
Possibly this would be easier if this check is performed inside _filter_usecols
.
Then we could check at the same time for out of bounds integer usecols (eg usecols=[0,2]
if there are only 2 columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell the way the C parser works is that it will populate the names from the names value passed into the TextReader class. So if I wanted to read columns 'A' and 'C' from a CSV but rename them as 'Category' and 'Total' (names parameter) usecols compares against the names parameter
Another example (which happens with the pandas.io.tests.parser.test_parsers.TestCParserLowMemory test):
s = """0,1,20140101,0900,4
0,1,20140102,1000,4"""
parse_dates = [[1, 2]]
names = list('acd')
df = self.read_csv(StringIO(s), names=names,
usecols=[0, 2, 3],
parse_dates=parse_dates)
Fails because index 3 doesn't exist in the names list passed in (it's only 0-2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @gfyoung
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my concern here is how should the usecols function if the names parameter is used? The documentation states:
All elements in this array must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s)
Is that to mean that if the names parameter is passed in then usecols filters on those names or does it mean that I should be able to pass in values from either the headers or the names parameter? If it's the former then perhaps I can update the documentation to state "name (if supplied)" under the usecols argument. If it's the latter then how are to know which list to filter based on an integer index? The header or the names?
The other question is should the usecols be applied / filtered first before the names are applied to the columns?
@@ -54,6 +54,10 @@ def test_usecols(self): | |||
expected.columns = ['foo', 'bar'] | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
# same length but usecols column doesn't exist - see gh-14671 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this in a separate test (eg call it test_usecols_non_existing
)? And also add an explicit test case for the 'normal' case where the length does not match the number of columns (which already raises), and tests for both cases but using integers.
|
||
#if self.usecols is not None: | ||
# if len(set(self.usecols) - set(header[0])) > 0 and len(set(self.usecols) - set(range(0,field_count))) > 0: | ||
# raise ValueError("Usecols do not match names.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clean this up?
>>> data = 'a,b\n1,2'
>>> read_csv(StringIO(data), usecols=[0, 2], engine='python')
a
0 1 Given the scope of your patch (which applies to both out-of-bounds indices and non-existent names), you will have to also make sure the Python engine raises similarly in this situation for consistency. |
pls comment if youd like to update / rebase and continue |
fixes #14671