Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openpyxl engine for reading excel files #25092

Merged
merged 87 commits into from
Jun 28, 2019
Merged
Show file tree
Hide file tree
Changes from 83 commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
e29b4c0
prepare testing reading excel files with multiple engines
tdamsma Feb 2, 2019
e0199a8
add openpyxl tests
tdamsma Feb 2, 2019
ce4eb01
implement first version of openpyxl reader
tdamsma Feb 2, 2019
b25877e
pep8 issues
tdamsma Feb 2, 2019
821fa4d
suppress openpyxl warnings
tdamsma Feb 2, 2019
4694668
add code for all edge cases that are tested for. Unfortunately got pr…
tdamsma Feb 7, 2019
712f1ef
formatting
tdamsma Feb 7, 2019
1d49a0e
Merge commit '683c7b55f5195fdf4f524239066cbf6f1301f0e7' into openpyxl…
tdamsma Feb 7, 2019
1473c0e
improve docstring
tdamsma Feb 7, 2019
6e8ffba
also test openpyxl reader for .xlsm files
tdamsma Feb 7, 2019
d57dfc1
explicitly use 64bit floats and ints
tdamsma Feb 7, 2019
e984f6b
Merge commit '6359bbc4c9ce6dd05bc8b422641cda74871cde43' into openpyxl…
tdamsma Feb 11, 2019
44f7af2
formatting
tdamsma Feb 11, 2019
98d3865
skip TestOpenpyxlReader when openpyxl is not installed
tdamsma Feb 11, 2019
d0188ba
Attempt to generalize _XlrdReader __init__ and move it to _BaseExcelR…
tdamsma Feb 12, 2019
205d52b
Merge commit 'f4568fd76e864d8aee3d23f5a81302262d6e0dcb' into openpyxl…
tdamsma Feb 20, 2019
7b550bf
register openpyxl writer engine, fix imports
tdamsma Feb 26, 2019
875de8d
import type_error explicitly
tdamsma Feb 26, 2019
12ad6d8
Merge branch 'master' into openpyxl-reader
tdamsma Mar 11, 2019
dfd6a36
Merge branch 'master' into openpyxl-reader
tdamsma Mar 19, 2019
fef7233
Merge branch 'master' into openpyxl-reader
tdamsma Apr 20, 2019
eaafd5f
get rid of some py2 compatibility legacy
tdamsma Apr 21, 2019
8d2db02
Merge branch 'master' into openpyxl-reader
tdamsma Apr 22, 2019
13e7793
fix some type chcking
tdamsma Apr 22, 2019
b053cce
linting
tdamsma Apr 22, 2019
fe4dd73
see if this works on linux
tdamsma Apr 22, 2019
64e5f2d
run isort on _openpyxl.py
tdamsma Apr 22, 2019
99b2cad
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Apr 23, 2019
ce5ac05
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Apr 23, 2019
c7895ea
Merge remote-tracking branch 'pandas/master' into openpyxl-reader
tdamsma Apr 27, 2019
2ca9368
refactor handling of sheet_name keyword
tdamsma Apr 27, 2019
5fb1aef
extract code to parse a single sheet to a method
tdamsma Apr 27, 2019
537dd0c
extract handling of header keywords
tdamsma Apr 27, 2019
44cddc5
extract handling of convert_float keyword to method
tdamsma Apr 27, 2019
e4c8f23
extract handling of index_col to method
tdamsma Apr 27, 2019
daff364
extract handling of usecols keyword to method
tdamsma Apr 27, 2019
1224918
remove redundant code
tdamsma Apr 27, 2019
1bfc030
Merge remote-tracking branch 'upstream/master' into excel-read-shared…
tdamsma Apr 28, 2019
747311e
Merge branch 'master' into excel-read-shared-init-to-baseclass
tdamsma Apr 28, 2019
a77a4c7
implement suggestions @WillAyd
tdamsma Apr 29, 2019
ddcaad8
Merge remote-tracking branch 'upstream/master' into excel-read-shared…
tdamsma Apr 29, 2019
757235d
Merge branch 'excel-read-shared-init-to-baseclass' into openpyxl-reader
tdamsma Apr 29, 2019
cdd627f
remove _engine keyword altogether
tdamsma Apr 29, 2019
0b58109
Merge branch 'excel-read-shared-init-to-baseclass' into openpyxl-reader
tdamsma Apr 29, 2019
45f21f8
Clean up __init__
tdamsma Apr 29, 2019
e97d029
Implement work around for Linux py35_compat import error
tdamsma Apr 29, 2019
1edae5e
fix regression for reading s3 files
tdamsma Apr 30, 2019
a69e104
Merge branch 'excel-read-shared-init-to-baseclass' into openpyxl-reader
tdamsma Apr 30, 2019
f5f40e4
expand code highlighting the weirdness of a failing/skipped test.
tdamsma Apr 30, 2019
22e24bb
remove _engine keyword altogether
tdamsma Apr 29, 2019
903b188
fix regression for reading s3 files
tdamsma Apr 30, 2019
1b3ae99
Merge branch 'excel-read-shared-init-to-baseclass' into openpyxl-reader
tdamsma Apr 30, 2019
02e19a8
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Apr 30, 2019
3e18f97
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Apr 30, 2019
d11956c
remove accidental commit
tdamsma May 1, 2019
61d7a3f
ditch some code
tdamsma May 1, 2019
13d41b2
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Jun 10, 2019
97c85f5
remove skips for openpyxl for tests that should pass
tdamsma Jun 11, 2019
614d972
Add `by_blocks=True` to failing `assert_frame_equal` tests, as per @W…
tdamsma Jun 13, 2019
d87d9c0
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
WillAyd Jun 27, 2019
7348b0c
Updated import machinery
WillAyd Jun 27, 2019
c1a1792
Cleaned up nan replacement
WillAyd Jun 27, 2019
d72ca5a
Simplified introspection
WillAyd Jun 27, 2019
0bba345
Used common renaming method
WillAyd Jun 27, 2019
8dd8bf6
Reverted some test changes
WillAyd Jun 27, 2019
eaaa680
Reset yield statement
WillAyd Jun 27, 2019
6bf5183
Better missing label handling
WillAyd Jun 27, 2019
a06bf9b
Aligned implementation with base
WillAyd Jun 27, 2019
f43e90f
Fix bool handling
WillAyd Jun 27, 2019
8fabe0a
Fixed 0 handling
WillAyd Jun 27, 2019
0ff5ce3
Aligned float handling with xlrd
WillAyd Jun 27, 2019
fb73692
xfailed overflow test
WillAyd Jun 27, 2019
17b1d73
lint and isort fixup
WillAyd Jun 27, 2019
3d248ed
Removed by_blocks
WillAyd Jun 27, 2019
c369fd8
Revert "Reverted some test changes"
tdamsma Jun 28, 2019
70b15a4
use readonly mode. Should be more performant and also this ignores Me…
tdamsma Jun 28, 2019
a3a3bca
formatting issues
tdamsma Jun 28, 2019
fcd43f0
handle datetime cells explicitly for openpyxl < 2.5.0 compatibility
tdamsma Jun 28, 2019
d9c1fa6
type fixup
WillAyd Jun 28, 2019
3c239a4
whatsnew
WillAyd Jun 28, 2019
4a25a5a
Removed np.nan from Scalar
WillAyd Jun 28, 2019
6258e59
revert test_reader changes again. Not needed anymore because of using…
tdamsma Jun 28, 2019
00f34b1
more types and whitespace cleanup
WillAyd Jun 28, 2019
a1fba90
Added config for excel reader. Not sure how to test this
tdamsma Jun 28, 2019
88ee325
whatsnew
WillAyd Jun 28, 2019
837ce26
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
WillAyd Jun 28, 2019
dddc8c5
Regenerated test1 files
WillAyd Jun 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ Other Enhancements
- :meth:`DataFrame.describe` now formats integer percentiles without decimal point (:issue:`26660`)
- Added support for reading SPSS .sav files using :func:`read_spss` (:issue:`26537`)
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
- :func:`read_excel` can now use openpyxl to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would put openpyxl in double-backquotes, but if this is the only issue, then can later


.. _whatsnew_0250.api_breaking:

Expand Down
1 change: 1 addition & 0 deletions pandas/_typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,4 @@
FilePathOrBuffer = Union[str, Path, IO[AnyStr]]

FrameOrSeries = TypeVar('FrameOrSeries', ABCSeries, ABCDataFrame)
Scalar = Union[str, int, float]
4 changes: 3 additions & 1 deletion pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -422,7 +422,7 @@ def parse(self,
data = self.get_sheet_data(sheet, convert_float)
usecols = _maybe_convert_usecols(usecols)

if sheet.nrows == 0:
if not data:
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
output[asheetname] = DataFrame()
continue

Expand Down Expand Up @@ -769,9 +769,11 @@ class ExcelFile:
"""

from pandas.io.excel._xlrd import _XlrdReader
from pandas.io.excel._openpyxl import _OpenpyxlReader

_engines = {
'xlrd': _XlrdReader,
'openpyxl': _OpenpyxlReader,
}

def __init__(self, io, engine=None):
Expand Down
74 changes: 73 additions & 1 deletion pandas/io/excel/_openpyxl.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
from pandas.io.excel._base import ExcelWriter
from typing import List

import numpy as np

from pandas.compat._optional import import_optional_dependency

from pandas._typing import FilePathOrBuffer, Scalar

from pandas.io.excel._base import ExcelWriter, _BaseExcelReader
from pandas.io.excel._util import _validate_freeze_panes


Expand Down Expand Up @@ -451,3 +459,67 @@ def write_cells(self, cells, sheet_name=None, startrow=0, startcol=0,
xcell = wks.cell(column=col, row=row)
for k, v in style_kwargs.items():
setattr(xcell, k, v)


class _OpenpyxlReader(_BaseExcelReader):

def __init__(self, filepath_or_buffer: FilePathOrBuffer) -> None:
"""Reader using openpyxl engine.

Parameters
----------
filepath_or_buffer : string, path object or Workbook
Object to be parsed.
"""
import_optional_dependency("openpyxl")
super().__init__(filepath_or_buffer)

@property
def _workbook_class(self):
from openpyxl import Workbook
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
return Workbook

def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
from openpyxl import load_workbook
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
return load_workbook(filepath_or_buffer,
read_only=True, data_only=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, you almost certainly want keep_links=False in there as well. On files that include data from other workbooks, Excel creates caches of the relevant worksheets and openpyxl preserves them by default. These can be pretty big and are read into memory and almost certainly irrelevant for Pandas.from_excel(). For an example see openpyxl bug #494.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


@property
def sheet_names(self) -> List[str]:
return self.book.sheetnames

def get_sheet_by_name(self, name: str):
return self.book[name]

def get_sheet_by_index(self, index: int):
return self.book.worksheets[index]

def _convert_cell(self, cell, convert_float: bool) -> Scalar:

# TODO: replace with openpyxl constants
if cell.is_date:
return cell.value
elif cell.data_type == 'e':
return np.nan
elif cell.data_type == 'b':
return bool(cell.value)
elif cell.value is None:
return '' # compat with xlrd
elif cell.data_type == 'n':
# GH5394
if convert_float:
val = int(cell.value)
if val == cell.value:
return val
else:
return float(cell.value)

return cell.value

def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
data = [] # type: List[List[Scalar]]
for row in sheet.rows:
data.append(
[self._convert_cell(cell, convert_float) for cell in row])

return data
11 changes: 10 additions & 1 deletion pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,17 @@ class TestReaders:
# Add any engines to test here
pytest.param('xlrd', marks=pytest.mark.skipif(
not td.safe_import("xlrd"), reason="no xlrd")),
pytest.param('openpyxl', marks=pytest.mark.skipif(
not td.safe_import("openpyxl"), reason="no openpyxl")),
pytest.param(None, marks=pytest.mark.skipif(
not td.safe_import("xlrd"), reason="no xlrd")),
])
def cd_and_set_engine(self, request, datapath, monkeypatch):
def cd_and_set_engine(self, request, datapath, monkeypatch, read_ext):
"""
Change directory and set engine for read_excel calls.
"""
if request.param == 'openpyxl' and read_ext == '.xls':
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
pytest.skip()
func = partial(pd.read_excel, engine=request.param)
monkeypatch.chdir(datapath("io", "data"))
monkeypatch.setattr(pd, 'read_excel', func)
Expand Down Expand Up @@ -397,6 +401,9 @@ def test_date_conversion_overflow(self, read_ext):
[1e+20, 'Timothy Brown']],
columns=['DateColWithBigInt', 'StringCol'])

if pd.read_excel.keywords['engine'] == 'openpyxl':
pytest.xfail("Maybe not supported by openpyxl")

result = pd.read_excel('testdateoverflow' + read_ext)
tm.assert_frame_equal(result, expected)

Expand Down Expand Up @@ -724,6 +731,8 @@ class TestExcelFileRead:
# Add any engines to test here
pytest.param('xlrd', marks=pytest.mark.skipif(
not td.safe_import("xlrd"), reason="no xlrd")),
pytest.param('openpyxl', marks=pytest.mark.skipif(
not td.safe_import("openpyxl"), reason="no openpyxl")),
pytest.param(None, marks=pytest.mark.skipif(
not td.safe_import("xlrd"), reason="no xlrd")),
])
Expand Down