read_excel return empty dataframe when using usecols #18273

SenWang · 2017-11-14T02:17:09Z

In [3]: data = pd.read_excel("A.xlsx")

In [4]: data
Out[4]:
   A  B
0  1  2
1  3  4

In [5]: data1 = pd.read_excel("A.xlsx",usecols=['B'])

In [6]: data1
Out[6]:
Empty DataFrame
Columns: []
Index: []

In [7]: pd.__version__
Out[7]: '0.21.0'

Problem description

Having a excel file name A.xlsx(or A.xls) with column A,B
read_excel return empty dataframe if usecols used

chris-b1 · 2017-11-14T17:21:52Z

Guessing this is due to a conflict in the two different kind of specs we accept for usecols - columns labels, or Excel column letters (A, B, C, ...). A workaround is to select the Excel column completely ('B:B') but this should work.

df = pd.DataFrame({'A': [1, 3], 'B': [3, 4]})
df.to_excel('tmp.xlsx', index=False)

pd.read_excel('tmp.xlsx', usecols=['B'])
Out[86]: 
Empty DataFrame
Columns: []
Index: []


pd.read_excel('tmp.xlsx', usecols='B:B')
Out[88]: 
   B
0  3
1  4

jhall6226 · 2017-11-20T03:36:54Z

I'm interested in contributing and thought this looked like a good place to take a first step.

Reviewing pandas/io/excel.py, it looks like the change needs to be made in the _should_parse function of the ExcelFile class. Specifically, here:

pandas/pandas/io/excel.py

Lines 355 to 360 in 1915ffc

    
           if isinstance(usecols, int): 
        
               return i <= usecols 
        
           elif isinstance(usecols, compat.string_types): 
        
               return i in _range2cols(usecols) 
        
           else: 
        
               return i in usecols

It looks like the current implementation checks for an integer first (i.e. a max number of columns to use), a string second (i.e. assuming a comma separated list of column names in a single string), and assumes a list (technically any container that implements the "in" operator) otherwise. When a list is assumed for usecols, the check for the column index (i) assumes that it is a list of integers.

The simplest way to implement the requested functionality would be to add a new conditional to check whether the first element is a string and, if so, concatenate the list into a single string like case 2 and re-use the _range2cols function to convert to numeric values before returning the comparison:

if isinstance(usecols, list) and isinstance(usecols[0], compat.string_types):
    return i in _range2cols(', '.join(usecols))

Additionally, there would probably need to be another check (if there isn't already?) to handle the behavior for an empty list. Should this be assumed to mean the same thing as None? It doesn't make sense to read a sheet and not return any data.

If we take this a little further, we could add support for mixed lists of integers and strings (if desired) by doing something like:

def _list2cols(area_list):
    parsed_list = list()
    for e in area_list:
        if isinstance(e, int):
            parsed_list.append(e)
        elif isinstance(e, compat.string_types):
            parsed_list.extend(_range2cols(e))
        else:
            pass # Assuming other types should not be considered

    return parsed_list

if isinstance(usecols, list):
    return i in _list2cols(usecols)

Interested in feedback on which direction should be taken.

chris-b1 · 2017-11-20T16:59:55Z

What I think would be easiest here would be to only have _should_parse handle the case when usecols is an Excel column specification (e.g. 'A,B,D:E'), and in all other cases pass through usecols to TextParser here.

pandas/pandas/io/excel.py

Line 529 in 1915ffc

parser = TextParser(data, header=header, index_col=index_col,

It already has logic to handle column names/locations, and will raise in the mixed case.

from pandas.io.parsers import TextParser
TextParser([['a', 'b', 'c'],
            [1, 2, 3]], usecols=['b']).read()

Out[81]: 
   b
0  2

jhall6226 · 2017-11-20T18:52:40Z

Ok. So in that case, you would remove the preprocessing steps here.

pandas/pandas/io/excel.py

Line 478 in 1915ffc

if usecols is not None and j not in should_parse:

Then, you would call a function like the current _should_parse to convert the usecols value (which would only have to be done once vice the rows x columns amount of times that it is currently done) into something readable by TextParser and pass that new value where you referenced.

Is that the correct interpretation?

chris-b1 · 2017-11-20T21:27:00Z

Yes, I'm thinking that will most straightforward, rather than re-implementing what TextParser does. Feel free to put up a work-in-progress pull request and ping me, it is often easier to answer questions looking at the actual changes.

…

On Mon, Nov 20, 2017 at 12:53 PM, Jordan Hall ***@***.***> wrote: Ok. So in that case, you would remove the preprocessing steps here. https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff0 0efa392c82/pandas/io/excel.py#L478 <http://url> Then, you would call a function like the current _should_parse to convert the usecols value (which would only have to be done once vice the rows x columns amount of times that it is currently done) into something readable by TextParser and pass that new value where you referenced. Is that the correct interpretation? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#18273 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB1b_K69yoT3EOyfp3ALntzCjc8BZ9dHks5s4cq3gaJpZM4Qcs_f> .

JeroenDelcour · 2017-11-30T13:12:12Z

Any progress on this? I couldn't find the mentioned pull request.

Vonatzki · 2018-01-09T09:11:38Z

Wow, after 2 years of using this, I stumbled upon this after I updated my pandas.

This is scary since most of my scripts use the usecol parameter heavily. I guess I have to downgrade for now.

kuraga · 2018-01-14T20:43:37Z

Yeah, the same issue with 0.21.0 and 0.22.0. No issues with 0.20.3.

All columns' names are cyrillic complex (not A, B, etc.)

linlinzhao · 2018-01-15T09:51:03Z

I have a set of excel files, all of which have a few same columns I need to read, but other columns are different. The old usecols with specified column names works nicely.

Do I have to downgrade Pandas? Currently I have 0.22.0

ldacey · 2018-01-24T19:51:16Z

Same issue here. usecols is returning an empty dataframe after upgrading.

weifei0228 · 2018-02-01T07:02:15Z

Same issue here.usecols is returning an empty dataframe with 0.22.0

Bravico · 2018-03-19T02:36:00Z

Same issue in empty dataframe returning with 0.22.0

LISHITING · 2018-03-23T04:53:44Z

For me
Instead of typing usecols=['B'], try usecols='B'

jacksonjos · 2018-03-25T00:13:25Z

I'm going to work on this issue to solve it.

If anyone is already working on it, please, tell me.

closes pandas-dev#18273 tests added/passed passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 whatsnew entry As mentioned read_excel returns an empty DataFrame when usecols argument is a list of strings. Now lists of strings are correctly interpreted by read_excel function.

capability of passing column labels for columns to be read - [x] closes pandas-dev#18273 - [x] tests added / passed - [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 - [x] whatsnew entry Created 'usecols_excel' that receives a string containing comma separated Excel ranges and columns. Changed 'usecols' named argument, now it receives a list of strings containing column labels or a list of integers representing column indexes or a callable for 'read_excel' function. Created and altered tests to reflect the new usage of these named arguments. 'index_col' keyword used to indicated which columns in the subset of selected columns by 'usecols' or 'usecols_excel' that should be the index of the DataFrame read. Now 'index_col' indicates which columns of the DataFrame will be the index even if that column is not in the subset of the selected columns.

jacksonjos · 2018-06-06T11:31:29Z

@chris-b1, may you take a look at issue #20480, please?

@jreback requested your review there again and without you it can't be closed.
Thank you for your attention.

capability of passing column labels for columns to be read - [x] closes pandas-dev#18273 - [x] tests added / passed - [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 - [x] whatsnew entry Created 'usecols_excel' that receives a string containing comma separated Excel ranges and columns. Changed 'usecols' named argument, now it receives a list of strings containing column labels or a list of integers representing column indexes or a callable for 'read_excel' function. Created and altered tests to reflect the new usage of these named arguments. 'index_col' keyword used to indicated which columns in the subset of selected columns by 'usecols' or 'usecols_excel' that should be the index of the DataFrame read. Now 'index_col' indicates which columns of the DataFrame will be the index even if that column is not in the subset of the selected columns.

Teckloon · 2018-07-03T02:30:37Z

Thanks for resolving the read_excel return empty dataframe.

gfyoung · 2018-10-28T23:15:32Z

Throwing my two cents in:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Supporting lists of strings is not technically addressed in the documentation, so I'm a little hesitant to call this a bug as of the current version of pandas (0.23.4).

That being said, this issue does bring up a lot of questions re: how to handle usecols for read_excel, in particular, why its handling is so different from usecols for CSV:

Column ranges (e.g. A:C) - that's totally fine with me. That's special to Excel.
"If int then indicates last column to be parsed" - We don't support this for CSV. I don't see why we support this for Excel?
List-like support for Excel is pretty bad. We don't support values like usecols=[0, 1, 2] or usecols=['A', 'B', 'C'], which I think we should.
We don't support callables for usecols, though I don't see why we shouldn't.

What do you guys think?

WillAyd · 2018-10-28T23:22:41Z

My personal belief is that usecols should operate the same way that it does in read_csv and excel-specific behavior should be handled in a separate parameter (something to the effect of use_range).

It would require some deprecations from the current state but I think that logical separation would clarify any ambiguity between range "A:A" in Excel and a column named "A".

gfyoung · 2018-10-28T23:27:20Z

My personal belief is that usecols should operate the same way that it does in read_csv and excel-specific behavior should be handled in a separate parameter (something to the effect of use_range).

@WillAyd : Not sure I fully agree with adding a new parameter, but at least we have consensus that usecols should be widely consistent across the board. 👍

@jreback @chris-b1 : Thoughts?

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified two major bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs Closes pandas-devgh-18273. Closes pandas-devgh-20480.

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes gh-18273. Closes gh-20480.

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.

shm007g · 2019-02-01T03:47:34Z

Problem still exists in pandas==0.23.4.
I just want to read specific columns of excel file with column names, i don' t know what column number display in excel.

jreback · 2019-02-01T05:00:06Z

this was closed in 0.24.0

The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.

chris-b1 added Bug Difficulty Novice IO Excel read_excel, to_excel labels Nov 14, 2017

chris-b1 added this to the Next Major Release milestone Nov 14, 2017

jreback added good first issue and removed good first issue Difficulty Novice labels Dec 15, 2017

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Feb 1, 2018

jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Feb 1, 2018

jreback modified the milestones: 0.23.0, Next Major Release Feb 1, 2018

jacksonjos mentioned this issue Mar 25, 2018

BUG: read_excel return empty dataframe when using usecols #20480

Closed

4 tasks

jreback removed this from the Next Major Release milestone Mar 25, 2018

gfyoung mentioned this issue Nov 7, 2018

BUG: Delegate more of Excel parsing to CSV #23544

Merged

jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 8, 2018

jreback closed this as completed in #23544 Nov 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_excel return empty dataframe when using usecols #18273

read_excel return empty dataframe when using usecols #18273

SenWang commented Nov 14, 2017

chris-b1 commented Nov 14, 2017

jhall6226 commented Nov 20, 2017 •

edited

Loading

chris-b1 commented Nov 20, 2017

jhall6226 commented Nov 20, 2017 •

edited

Loading

chris-b1 commented Nov 20, 2017 via email

JeroenDelcour commented Nov 30, 2017

Vonatzki commented Jan 9, 2018

kuraga commented Jan 14, 2018

linlinzhao commented Jan 15, 2018 •

edited

Loading

ldacey commented Jan 24, 2018

weifei0228 commented Feb 1, 2018

Bravico commented Mar 19, 2018

LISHITING commented Mar 23, 2018

jacksonjos commented Mar 25, 2018

jacksonjos commented Jun 6, 2018

Teckloon commented Jul 3, 2018

gfyoung commented Oct 28, 2018 •

edited

Loading

WillAyd commented Oct 28, 2018

gfyoung commented Oct 28, 2018

shm007g commented Feb 1, 2019

jreback commented Feb 1, 2019

read_excel return empty dataframe when using usecols #18273

read_excel return empty dataframe when using usecols #18273

Comments

SenWang commented Nov 14, 2017

Problem description

chris-b1 commented Nov 14, 2017

jhall6226 commented Nov 20, 2017 • edited Loading

chris-b1 commented Nov 20, 2017

jhall6226 commented Nov 20, 2017 • edited Loading

chris-b1 commented Nov 20, 2017 via email

JeroenDelcour commented Nov 30, 2017

Vonatzki commented Jan 9, 2018

kuraga commented Jan 14, 2018

linlinzhao commented Jan 15, 2018 • edited Loading

ldacey commented Jan 24, 2018

weifei0228 commented Feb 1, 2018

Bravico commented Mar 19, 2018

LISHITING commented Mar 23, 2018

jacksonjos commented Mar 25, 2018

jacksonjos commented Jun 6, 2018

Teckloon commented Jul 3, 2018

gfyoung commented Oct 28, 2018 • edited Loading

WillAyd commented Oct 28, 2018

gfyoung commented Oct 28, 2018

shm007g commented Feb 1, 2019

jreback commented Feb 1, 2019

jhall6226 commented Nov 20, 2017 •

edited

Loading

jhall6226 commented Nov 20, 2017 •

edited

Loading

linlinzhao commented Jan 15, 2018 •

edited

Loading

gfyoung commented Oct 28, 2018 •

edited

Loading