-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_excel return empty dataframe when using usecols #18273
Comments
Guessing this is due to a conflict in the two different kind of specs we accept for usecols - columns labels, or Excel column letters (A, B, C, ...). A workaround is to select the Excel column completely ( df = pd.DataFrame({'A': [1, 3], 'B': [3, 4]})
df.to_excel('tmp.xlsx', index=False)
pd.read_excel('tmp.xlsx', usecols=['B'])
Out[86]:
Empty DataFrame
Columns: []
Index: []
pd.read_excel('tmp.xlsx', usecols='B:B')
Out[88]:
B
0 3
1 4 |
I'm interested in contributing and thought this looked like a good place to take a first step. Reviewing pandas/io/excel.py, it looks like the change needs to be made in the _should_parse function of the ExcelFile class. Specifically, here: Lines 355 to 360 in 1915ffc
It looks like the current implementation checks for an integer first (i.e. a max number of columns to use), a string second (i.e. assuming a comma separated list of column names in a single string), and assumes a list (technically any container that implements the "in" operator) otherwise. When a list is assumed for usecols, the check for the column index (i) assumes that it is a list of integers. The simplest way to implement the requested functionality would be to add a new conditional to check whether the first element is a string and, if so, concatenate the list into a single string like case 2 and re-use the _range2cols function to convert to numeric values before returning the comparison: if isinstance(usecols, list) and isinstance(usecols[0], compat.string_types):
return i in _range2cols(', '.join(usecols)) Additionally, there would probably need to be another check (if there isn't already?) to handle the behavior for an empty list. Should this be assumed to mean the same thing as None? It doesn't make sense to read a sheet and not return any data. If we take this a little further, we could add support for mixed lists of integers and strings (if desired) by doing something like: def _list2cols(area_list):
parsed_list = list()
for e in area_list:
if isinstance(e, int):
parsed_list.append(e)
elif isinstance(e, compat.string_types):
parsed_list.extend(_range2cols(e))
else:
pass # Assuming other types should not be considered
return parsed_list
if isinstance(usecols, list):
return i in _list2cols(usecols) Interested in feedback on which direction should be taken. |
What I think would be easiest here would be to only have Line 529 in 1915ffc
It already has logic to handle column names/locations, and will raise in the mixed case. from pandas.io.parsers import TextParser
TextParser([['a', 'b', 'c'],
[1, 2, 3]], usecols=['b']).read()
Out[81]:
b
0 2 |
Ok. So in that case, you would remove the preprocessing steps here. Line 478 in 1915ffc
Then, you would call a function like the current _should_parse to convert the usecols value (which would only have to be done once vice the rows x columns amount of times that it is currently done) into something readable by TextParser and pass that new value where you referenced. Is that the correct interpretation? |
Yes, I'm thinking that will most straightforward, rather than
re-implementing what TextParser does. Feel free to put up a
work-in-progress pull request and ping me, it is often easier to answer
questions looking at the actual changes.
…On Mon, Nov 20, 2017 at 12:53 PM, Jordan Hall ***@***.***> wrote:
Ok. So in that case, you would remove the preprocessing steps here.
https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff0
0efa392c82/pandas/io/excel.py#L478 <http://url>
Then, you would call a function like the current _should_parse to convert
the usecols value (which would only have to be done once vice the rows x
columns amount of times that it is currently done) into something readable
by TextParser and pass that new value where you referenced.
Is that the correct interpretation?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18273 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB1b_K69yoT3EOyfp3ALntzCjc8BZ9dHks5s4cq3gaJpZM4Qcs_f>
.
|
Any progress on this? I couldn't find the mentioned pull request. |
Wow, after 2 years of using this, I stumbled upon this after I updated my pandas. This is scary since most of my scripts use the usecol parameter heavily. I guess I have to downgrade for now. |
Yeah, the same issue with 0.21.0 and 0.22.0. No issues with 0.20.3. All columns' names are cyrillic complex (not |
I have a set of excel files, all of which have a few same columns I need to read, but other columns are different. The old usecols with specified column names works nicely. Do I have to downgrade Pandas? Currently I have 0.22.0 |
Same issue here. usecols is returning an empty dataframe after upgrading. |
Same issue here.usecols is returning an empty dataframe with 0.22.0 |
Same issue in empty dataframe returning with 0.22.0 |
For me |
I'm going to work on this issue to solve it. If anyone is already working on it, please, tell me. |
closes pandas-dev#18273 tests added/passed passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 whatsnew entry As mentioned read_excel returns an empty DataFrame when usecols argument is a list of strings. Now lists of strings are correctly interpreted by read_excel function.
capability of passing column labels for columns to be read - [x] closes pandas-dev#18273 - [x] tests added / passed - [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 - [x] whatsnew entry Created 'usecols_excel' that receives a string containing comma separated Excel ranges and columns. Changed 'usecols' named argument, now it receives a list of strings containing column labels or a list of integers representing column indexes or a callable for 'read_excel' function. Created and altered tests to reflect the new usage of these named arguments. 'index_col' keyword used to indicated which columns in the subset of selected columns by 'usecols' or 'usecols_excel' that should be the index of the DataFrame read. Now 'index_col' indicates which columns of the DataFrame will be the index even if that column is not in the subset of the selected columns.
capability of passing column labels for columns to be read - [x] closes pandas-dev#18273 - [x] tests added / passed - [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 - [x] whatsnew entry Created 'usecols_excel' that receives a string containing comma separated Excel ranges and columns. Changed 'usecols' named argument, now it receives a list of strings containing column labels or a list of integers representing column indexes or a callable for 'read_excel' function. Created and altered tests to reflect the new usage of these named arguments. 'index_col' keyword used to indicated which columns in the subset of selected columns by 'usecols' or 'usecols_excel' that should be the index of the DataFrame read. Now 'index_col' indicates which columns of the DataFrame will be the index even if that column is not in the subset of the selected columns.
Thanks for resolving the read_excel return empty dataframe. |
Throwing my two cents in: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html Supporting lists of strings is not technically addressed in the documentation, so I'm a little hesitant to call this a bug as of the current version of That being said, this issue does bring up a lot of questions re: how to handle
What do you guys think? |
My personal belief is that usecols should operate the same way that it does in read_csv and excel-specific behavior should be handled in a separate parameter (something to the effect of It would require some deprecations from the current state but I think that logical separation would clarify any ambiguity between range "A:A" in Excel and a column named "A". |
@WillAyd : Not sure I fully agree with adding a new parameter, but at least we have consensus that |
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified two major bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes gh-18273. Closes gh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
Problem still exists in pandas==0.23.4. |
this was closed in 0.24.0 |
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
Problem description
Having a excel file name A.xlsx(or A.xls) with column A,B
read_excel return empty dataframe if usecols used
The text was updated successfully, but these errors were encountered: