-
Notifications
You must be signed in to change notification settings - Fork 41
Optional indexes #17
Comments
+1, although I think the opposite approach may also be worth consideration. What if instead of a being a special property of a
This would remove the That said, it's not clear to me how a |
@chris-b1 Yes, in fact I almost included "indexes as just a special type of column" as part of this issue, but then decided to save it for another one. Since you brought it up (and it's related), we might as well discuss it here. I also really like this idea, because it's a major stumbling block for both new (and experienced) users. It does entail a major overhaul of the pandas data model, though, which raises a number of questions. In particular: do we still use an This raises a big issue with how we handle "messy" data (i.e., non-tidy data). Currently, pandas is a pretty capable tool for such datasets, especially with the ability to use Consider this example adapted from the multi-index docs: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo',],
['one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=pd.Index(['A', 'B', 'C'], name='letter'), columns=index)
print df
Do we now need to add in an extra level to the multi-index for the index column name (e.g., In general, I don't think it's worth a huge amount of effort to make it easy work with such data, given how much nicer tidy data is and the existence of multi-dimensional alternatives with a cleaner data model in the form of xarray, but such change would certainly going to break a non-negligible number of workflows. Making indexes optional would be a much less ambiguous win.
I think it could work in roughly the same way it currently does. Pulling a column out of a DataFrame would return Series object, associating with it any indexes on the frame. |
I'll put some thought into this when I have a chance, but: one possibility to consider is exposing a more primitive
The Table would be more like an R data.frame / data frames. |
Would the |
One thought would be to equip |
I'm a really big fan of the See also the datascience package for teaching introductory data science in Python. On the other hand, the downside is that now we have two similar core data structures for tabular data. With the dynamic nature of Python, this could easily lead to confusion, and also twice the API to maintain.
I'm all for deferred APIs, but I'm less sure that this makes sense for the |
I definitely like the idea of a base |
@shoyer I agree that having a deferred API as a separate beast would be better, and making the basic table a pared down, indexless DataFrame (with all operations eagerly evaluated). Was just curious what you all thought =) |
I am probably going to make indexes optional in the next version of xarray (pydata/xarray#1017). (Note that in xarray, an index already is basically just a special kind of column, but currently we always generate an index like It occurs to me that an additional virtue of optional indexes is that it could allow us to further cleanup |
The
pandas.Index
is fantastically useful, but in many cases pandas's insistence on always having an index gets in the way.Usually, it can be safely ignored when not relevant, especially now that we have
RangeIndex
(which makes the cost of creating the index minimal), but this is not always the case:RangeIndex
is actively harmful. It would be better to raise an error when implicitly joining on an index between two datasets with a default index.index=True
) for controlling whether or not to include the index.I propose that we make the index optional, e.g., by allowing it to be set to
None
. This entails a need for some rules to handle missing indexes:.loc
andjoin
) should raiseTypeError
when called on objects without an index.Somewhat related: #15
The text was updated successfully, but these errors were encountered: