Output excel table objects with to_xlsx() #24862

tdamsma · 2019-01-21T15:55:36Z

Currently pandas can quickly write a dataframe to an excel sheet. However the output is a plain workbook with a bunch of values, and not the much more powerful excel table object. Excel table objects are very useful because they allow referencing columns/cells by header name instead of $A$23, better filtering, sorting, formatting, pivoting, plotting etc. Of course this can be achieved by selecting the cells output by Pandas and use the Format as Table functionality, but why not support this out of the box?

With XlsxWriter this functionality is fully supported , so it should not be too hard to implement. I would be willing to make a PR if there is any interest, though I might need some guidance as I am not familiar with the Pandas code base.

The text was updated successfully, but these errors were encountered:

chris-b1 · 2019-01-21T16:11:50Z

I think this would be a reasonable enhancement - probably via some kind of parameter to_excel(..., format='cells|table') ?

I'd start around here for the existing Excel logic. You can ping on here with questions or also feel free to put up a WIP PR.

pandas/pandas/io/formats/excel.py

Line 631 in 6d918f0

def write(self, writer, sheet_name='Sheet1', startrow=0,

There are probably some annoying corner cases with MultiIndex that don't fit into the Table format - can probably raise on those.

tdamsma · 2019-01-21T21:07:30Z

I am trying to bolt on this functionality as per @chris-b1 suggestion with an extra keyword to the already very heave to_excel(... function. Problem is that I can't just apply some extra formatting somewhere deep down in the writer function, as the table constructor needs to know if the header should be included, and if so what the column names are etc. So there is a bit of a mismatch with the current implementation. Also, a lot of the functionality to merge cells (and perhaps even to apply custom formatting on a per cell basis) would be a bit superfluous. ow about a separate method to_excel_table(... which could have a separate set of options from the to_excel function?

For an idea of the options we could support, here is a snippet for OpenPyXL (full example here:

# add column headings. NB. these must be strings
ws.append(["Fruit", "2011", "2012", "2013", "2014"])
for row in data:
    ws.append(row)

tab = Table(displayName="Table1", ref="A1:E5")

# Add a default style with striped rows and banded columns
style = TableStyleInfo(name="TableStyleMedium9", showFirstColumn=False,
                       showLastColumn=False, showRowStripes=True, showColumnStripes=True)

And from XlsxWriter

worksheet.add_table('B3:F7', {'data': data,
                              'style': 'Table Style Light 11'})

XlsxWriter supports the following keywords:

autofilter
header_row
banded_columns
banded_rows
first_column
last_column
style
total_row
columns
name

WillAyd · 2019-01-21T21:18:45Z

Without having looked to deeply it feels like toggling this via a keyword argument would still be the better way to go, as a separate function would I think end up duplicating a lot of functionality. With that said, if you see a better way of going about it always open to ideas - probably just best to push a PR and have it reviewed in that case

jreback · 2019-01-21T21:34:17Z

going to be -1 on a to_excel_table

this certainly should be done via a keyword

tdamsma · 2019-01-22T07:26:31Z

Clear, I'll give it a go and see where it leads

tdamsma · 2019-01-24T10:04:13Z

@chris-b1, I put up the WIP PR.

For discussion, which part of the current to_excel api should be supported? This is my suggestion

excel_writer ✔️
sheet_name ✔️
na_rep ✔️
float_format ✔️
columns ✔️
header ✔️
index ✔️(Every column needs a name though, if the header is included. I currently implemented defaults similar to what Excel does: Column0, Column1)
index_label ✔️
startrow ✔️ (Though it I think it is bad practice to put more than one table on a sheet, staring at cell A1, and would not mind discouraging any other use)
startcol ✔️
engine Only OpenPyXL and XlsxWriter
merge_cells Can not be supported
encoding ✔️️
inf_rep ✔️
verbose ✔️
freeze_panes Am not sure if the really makes sense, as the header rows is automatically semi frozen (the column names A, B etc are replaced with the header names) when a table is used and the header is scrolled out of the window.

Proposed:

as_table I like this better than a format keyword
table_name Think this would be desirable to add

And then there are many other other options that could be supported. think it would be better to leave that out though:

autofilter
banded_columns
banded_rows
first_column
last_column
style
total_row

tdamsma · 2019-01-30T10:06:35Z

@WillAyd, @jreback, @chris-b1 To effectively test the proposed as_table functionality, there needs to be a function the reads an excel table back into a dataframe, see also discussion of #24899. I came up with the following, which can read all tables in a workbook (regardless of which workbook they are in) into separate dataframes.

from openpyxl import load_workbook
from pandas.core.dtypes.common import is_list_like
from pandas.core.frame import DataFrame
def read_excel_tables(io, table_name=None, index_col='auto'):
    """Read an Excel table into a pandas dataframe.
    
    Only supports xlsx files.
    
    Parameters
    ----------
    io : str, file descriptor or pathlib.Path
    table_name : str or None, default None
        Strings are used for table names. Specify None to get all 
        tables in a dict of dataframes. 
    index_col: int, list of int, None or 'auto', Default 'auto'
        Column (0-indexed) to use as the row labels of the DataFrame.
        Pass None if there is no such column.  If a list is passed,
        those columns will be combined into a ``MultiIndex``.
        'auto' will determine if there is an index column from the table
        `First Column` option in Excel
    """
    def get_tables(io):
        # unfortunately tables are only parsed in the slower write mode
        wbk = load_workbook(io, data_only=True, read_only=False)
        tables = {}
        for wks in wbk:
            for t in wks._tables:
                tables[t.name] = dict(table=t, wks=wks)
        return tables

    def read_table(table, wks, index_col):
        columns = [col.name for col in table.tableColumns]
        data_rows = wks[table.ref][
            (table.headerRowCount or 0):
            -table.totalsRowCount if table.totalsRowCount is not None else None]
        data = [[cell.value for cell in row] for row in data_rows]
        frame = DataFrame(data, columns=columns, index=None)
        if index_col:
            if index_col == 'auto':
                if table.tableStyleInfo.showFirstColumn:
                    frame.set_index(columns[0])
            elif is_list_like(index_col):
                frame = frame.set_index([columns[i] for i in index_col])
            else:
                frame = frame.set_index(columns[index_col])
        return frame

    tables = get_tables(io)
    if table_name is not None:
        return read_table(**tables[table_name], index_col=index_col)
    else:
        return {k: read_table(**v, index_col=index_col) for k, v in tables.items()}

I think it would make perfect sense to also include this in pandas, however not really sure where to place that code. There is currently only an xlrd implementation for and excel reader, but xlrd does not support reading tables.

So then I have the following questions:

Would it make sense to implement a read table functionality first before a write table option so it is easier to test?
Should that be a separate PR?
Would an OpenPyXL reader need to support the full (and extensive) read_excel api?
Should reading an excel table be a keyword to read_excel? As an Excel table referenced by name already implies a sheet name, if there is a header, column names, index etc.

WillAyd · 2019-01-31T04:47:34Z

Would it make sense to implement a read table functionality first before a write table option so it is easier to test?

Makes sense to me!

Should that be a separate PR?

Yep

Would an OpenPyXL reader need to support the full (and extensive) read_excel api?

I would say yes; I don't think it makes sense to implement an engine for a small subset of functionality as it just makes our API more confusing. Note that there has been a decent amount of refactoring going on to better support community engagement on this. There's also an open issue #11499 for this.

I would think if anything this is the most logical starting point to just get the reader up and working. From there you could add table support (I think better as a keyword argument in read_excel instead of a separate function) and from there could start working on writing to get the entire roundtripping. Again just my $.02 and for sure easier said than done, but I would welcome any PRs towards that effort

Should reading an excel table be a keyword to read_excel? As an Excel table referenced by name already implies a sheet name, if there is a header, column names, index etc.

Yea I think a keyword argument makes the most sense as it keeps the API simplest and you could leverage the existing functionality of other applicable parameters

tdamsma · 2019-01-31T20:58:38Z

Didn't know that supporting more readers was an ongoing development. Are you still working on that? Also, just out of curiosity, what is the rationale for supporting multiple excel libraries?

WillAyd · 2019-01-31T21:04:53Z

I haven't actually put any code into it, just be reorganizing things in hopes of better community engagement.

As far as supporting multiple libraries goes the libraries support different file types. Openpyxl I believe is only .xlsx format, xlrd can handle .xls in addition to .xlsx. Off the top of my head I don't think either support .xlsb and not sure about .xlsm, so having multiple engines gives flexibility to seamlessly deal with different file types amongst potential other optimizations available to the end user

tdamsma · 2019-02-01T07:03:36Z

I'll give it a go then to make an OpenPyXL reader. I see there are pretty extensive tests for read_excel and I can just add an extra engine param, so hopefully that should be pretty straightforward.

# Conflicts: # pandas/core/generic.py # pandas/io/excel.py # pandas/io/formats/excel.py # pandas/tests/io/test_excel.py

WillAyd · 2019-07-09T15:41:41Z

I'm wondering if just having a table= keyword in to_excel would work here? The advantage of that (at least in the openpyxl world) is that we could let third party libraries take care of table formatting (like via TableStyleInfo) and get rid of the ExcelFormatter stuff we have somewhat half-baked into pandas

cc @Themanwithoutaplan for thoughts

Themanwithoutaplan · 2019-07-09T18:40:39Z

I'm wondering if just having a table= keyword in to_excel would work here? The advantage of that (at least in the openpyxl world) is that we could let third party libraries take care of table formatting (like via TableStyleInfo) and get rid of the ExcelFormatter stuff we have somewhat half-baked into pandas

cc @Themanwithoutaplan for thoughts
Would it need anything else from openpyxl? In general I find the signatures of these handy-but-kitchen-sink functions already very extensive and, as long as you're not in write-only mode, the workbook can be changed subsequently by the user, which I think is preferable.

tdamsma · 2019-07-09T19:33:26Z

I just merged the master to get a feel again for where I left off. Seems to work, but still a lot of loose ends to be discussed. @WillAyd, perhaps you could reopen #24899? Or maybe keep the discussion here for the moment. In the meantime I'll try to see if I can get a table keyword in the read_excel function. Should reading and writing tables be the same pr?

tdamsma · 2021-10-19T15:53:55Z

I bit the bullet and created a dedicated package to read and write properly formatted excel tables with pandas: https://github.com/VanOord/pandas-xlsx-tables

mroeschke · 2023-11-21T17:32:50Z

Appears there hasn't been much activity or community support for this feature in a while so closing. Happy to reopen if there's renewed support

tdamsma changed the title ~~Output formatted excel table with to_xlsx()~~ Output excel table objects with to_xlsx() Jan 21, 2019

chris-b1 added the IO Excel read_excel, to_excel label Jan 21, 2019

gfyoung added the Enhancement label Jan 21, 2019

WillAyd added this to the Contributions Welcome milestone Jan 21, 2019

tdamsma added a commit to tdamsma/pandas that referenced this issue Jan 24, 2019

ENH: implement excel table (pandas-dev#24862)

32f10e5

tdamsma mentioned this issue Jan 24, 2019

[WIP] Excel table output #24899

Closed

4 tasks

tdamsma added a commit to tdamsma/pandas that referenced this issue Jan 27, 2019

Merge branch 'master' into excel-tables-pandas-dev#24862

5f9d664

tdamsma mentioned this issue Feb 7, 2019

Openpyxl engine for reading excel files #25092

Merged

5 tasks

tdamsma added a commit to tdamsma/pandas that referenced this issue Jul 9, 2019

Merge branch 'master' into excel-tables-pandas-dev#24862

9636999

# Conflicts: # pandas/core/generic.py # pandas/io/excel.py # pandas/io/formats/excel.py # pandas/tests/io/test_excel.py

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

mroeschke closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output excel table objects with to_xlsx() #24862

Output excel table objects with to_xlsx() #24862

tdamsma commented Jan 21, 2019 •

edited

Loading

chris-b1 commented Jan 21, 2019

tdamsma commented Jan 21, 2019

WillAyd commented Jan 21, 2019

jreback commented Jan 21, 2019

tdamsma commented Jan 22, 2019

tdamsma commented Jan 24, 2019 •

edited

Loading

tdamsma commented Jan 30, 2019

WillAyd commented Jan 31, 2019

tdamsma commented Jan 31, 2019

WillAyd commented Jan 31, 2019

tdamsma commented Feb 1, 2019

WillAyd commented Jul 9, 2019

Themanwithoutaplan commented Jul 9, 2019

tdamsma commented Jul 9, 2019 •

edited

Loading

tdamsma commented Oct 19, 2021

mroeschke commented Nov 21, 2023

Output excel table objects with to_xlsx() #24862

Output excel table objects with to_xlsx() #24862

Comments

tdamsma commented Jan 21, 2019 • edited Loading

chris-b1 commented Jan 21, 2019

tdamsma commented Jan 21, 2019

WillAyd commented Jan 21, 2019

jreback commented Jan 21, 2019

tdamsma commented Jan 22, 2019

tdamsma commented Jan 24, 2019 • edited Loading

tdamsma commented Jan 30, 2019

WillAyd commented Jan 31, 2019

tdamsma commented Jan 31, 2019

WillAyd commented Jan 31, 2019

tdamsma commented Feb 1, 2019

WillAyd commented Jul 9, 2019

Themanwithoutaplan commented Jul 9, 2019

tdamsma commented Jul 9, 2019 • edited Loading

tdamsma commented Oct 19, 2021

mroeschke commented Nov 21, 2023

tdamsma commented Jan 21, 2019 •

edited

Loading

tdamsma commented Jan 24, 2019 •

edited

Loading

tdamsma commented Jul 9, 2019 •

edited

Loading