Optional indexes #17

shoyer · 2016-09-07T16:47:06Z

The pandas.Index is fantastically useful, but in many cases pandas's insistence on always having an index gets in the way.

Usually, it can be safely ignored when not relevant, especially now that we have RangeIndex (which makes the cost of creating the index minimal), but this is not always the case:

The indexing and join behavior of default RangeIndex is actively harmful. It would be better to raise an error when implicitly joining on an index between two datasets with a default index.
When converting a DataFrame into other formats, we need an argument (e.g., index=True) for controlling whether or not to include the index.

I propose that we make the index optional, e.g., by allowing it to be set to None. This entails a need for some rules to handle missing indexes:

Operations that explicitly rely on indexes (e.g., .loc and join) should raise TypeError when called on objects without an index.
Operations that implicitly rely on indexes for alignment (e.g., the DataFrame constructor and arithmetic) now need to handle three cases:
1. Index/index operations: These work as before. The result's index has an outer join of the input indexes
2. No-index/no-index operations: The inputs have the exact same length (or raise TypeError). The result has no index.
3. Mixed index/no-index operations: The inputs must have the same length. The result takes on the index from the input with an index.

Somewhat related: #15

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-09-07T17:04:26Z

+1, although I think the opposite approach may also be worth consideration.

What if instead of a being a special property of a DataFrame, an "Index" is just defined by a selection of columns in the frame

0 (your None)
1 (today's Index)
or more (something like a MultiIndex)

This would remove the Index / column distinction, which I think is a stumbling block for many. Some discussion here: pandas-dev/pandas#8162

That said, it's not clear to me how a Series with an Index fits into this world, and would be a bigger api change.

shoyer · 2016-09-07T17:51:29Z

@chris-b1 Yes, in fact I almost included "indexes as just a special type of column" as part of this issue, but then decided to save it for another one. Since you brought it up (and it's related), we might as well discuss it here.

I also really like this idea, because it's a major stumbling block for both new (and experienced) users.

It does entail a major overhaul of the pandas data model, though, which raises a number of questions. In particular: do we still use an Index/MultiIndex for DataFrame.columns? If so, then it follows that column names should now include the names of index columns.

This raises a big issue with how we handle "messy" data (i.e., non-tidy data). Currently, pandas is a pretty capable tool for such datasets, especially with the ability to use stack/unstack columns into hierarchies. But if columns is a MultiIndex with multiple levels, adding in index column names is going to make things a mess.

Consider this example adapted from the multi-index docs:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo',],
          ['one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=pd.Index(['A', 'B', 'C'], name='letter'), columns=index)
print df

first        bar                 baz                 foo          
second       one       two       one       two       one       two
letter                                                            
A       1.131677  3.008499 -1.513677  0.379074 -0.546790 -2.221491
B      -1.650027  2.157229 -1.030519 -0.187412  0.711109 -0.334537
C       1.226648  0.631318  0.197816  0.494960 -0.435740  1.098061

Do we now need to add in an extra level to the multi-index for the index column name (e.g., letter)? Or do we disallow a MultiIndex for column names altogether and use an index of tuples instead?

In general, I don't think it's worth a huge amount of effort to make it easy work with such data, given how much nicer tidy data is and the existence of multi-dimensional alternatives with a cleaner data model in the form of xarray, but such change would certainly going to break a non-negligible number of workflows. Making indexes optional would be a much less ambiguous win.

That said, it's not clear to me how a Series with an Index fits into this world

I think it could work in roughly the same way it currently does. Pulling a column out of a DataFrame would return Series object, associating with it any indexes on the frame.

wesm · 2016-09-07T18:19:45Z

I'll put some thought into this when I have a chance, but: one possibility to consider is exposing a more primitive pandas.Table to the user, as a "DataFrame without the Index".

Referencing a column would give a pandas.Array.
Combining an Array plus an Index you obtain a Series
Combining an Index with a Table produces a DataFrame

The Table would be more like an R data.frame / data frames.

chrisaycock · 2016-09-07T19:07:57Z

Would the Table have the same functionality as a DataFrame? I.e., queries, joins, aggregations, IO, etc?

wesm · 2016-09-07T20:17:29Z

One thought would be to equip Table with the most essential relational algebra and manipulations (add/remove columns, etc.) but make everything deferred (the deferred table DSL I designed for Ibis is one example of such a language that has effectively 1-1 parity with SQL, could provide some inspiration)

shoyer · 2016-09-08T01:10:28Z

One possibility to consider is exposing a more primitive pandas.Table to the user, as a "DataFrame without the Index".

I'm a really big fan of the Table data structure. It could do enough for most users and client libraries, and DataFrame could be left for those who need indexing and alignment (which are important but niche use cases).

See also the datascience package for teaching introductory data science in Python.

On the other hand, the downside is that now we have two similar core data structures for tabular data. With the dynamic nature of Python, this could easily lead to confusion, and also twice the API to maintain.

One thought would be to equip Table with the most essential relational algebra and manipulations (add/remove columns, etc.) but make everything deferred

I'm all for deferred APIs, but I'm less sure that this makes sense for the DataFrame/Table distinction. It dilutes the message of Table as "DataFrame without the Index".

chrisaycock · 2016-09-08T02:03:54Z

I definitely like the idea of a base Table and then a DataFrame that adds an index. Other than indices, there doesn't need to be a distinction in terms of functionality.

wesm · 2016-09-08T03:21:20Z

@shoyer I agree that having a deferred API as a separate beast would be better, and making the basic table a pared down, indexless DataFrame (with all operations eagerly evaluated). Was just curious what you all thought =)

shoyer · 2016-09-26T05:10:33Z

I am probably going to make indexes optional in the next version of xarray (pydata/xarray#1017). (Note that in xarray, an index already is basically just a special kind of column, but currently we always generate an index like range(n).) I guess we'll see how the transition goes, but I am tentatively very optimistic about it.

It occurs to me that an additional virtue of optional indexes is that it could allow us to further cleanup DataFrame.__getitem__ with sane mixing between label and position based indexing, because we can differentiate between intentional integer indexes and no index at all. I'll elaborate over in #22.

shoyer mentioned this issue Sep 24, 2016

WIP: Optional indexes (no more default coordinates given by range(n)) pydata/xarray#1017

Merged

5 tasks

shoyer mentioned this issue Sep 26, 2016

Simplifying indexing (DataFrame.__getitem__) #22

Open

chrisaycock mentioned this issue Sep 26, 2016

Unified merge API #31

Open

jreback added the indexing label Sep 30, 2016

jorisvandenbossche mentioned this issue Apr 6, 2017

Resetting Index on slice pandas-dev/pandas#15930

Closed

jbrockmendel mentioned this issue Jul 24, 2017

Index/Series Convergence pandas-dev/pandas#17061

Closed

jorisvandenbossche mentioned this issue Jun 5, 2020

index=False by default when exporting dataframes pandas-dev/pandas#34576

Closed

mroeschke mentioned this issue Oct 3, 2022

API Should Index be made opt-in? pandas-dev/pandas#48880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional indexes #17

Optional indexes #17

shoyer commented Sep 7, 2016 •

edited

Loading

chris-b1 commented Sep 7, 2016

shoyer commented Sep 7, 2016

wesm commented Sep 7, 2016

chrisaycock commented Sep 7, 2016

wesm commented Sep 7, 2016

shoyer commented Sep 8, 2016

chrisaycock commented Sep 8, 2016

wesm commented Sep 8, 2016

shoyer commented Sep 26, 2016

Optional indexes #17

Optional indexes #17

Comments

shoyer commented Sep 7, 2016 • edited Loading

chris-b1 commented Sep 7, 2016

shoyer commented Sep 7, 2016

wesm commented Sep 7, 2016

chrisaycock commented Sep 7, 2016

wesm commented Sep 7, 2016

shoyer commented Sep 8, 2016

chrisaycock commented Sep 8, 2016

wesm commented Sep 8, 2016

shoyer commented Sep 26, 2016

shoyer commented Sep 7, 2016 •

edited

Loading