Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow storing ExtensionArrays in the Index #34159

Closed

Conversation

jorisvandenbossche
Copy link
Member

I started experimenting with allowing EAs to be used in the Index (so we can put nullable int/bool, string dtype etc in the Index).

My approach is to use the base Index class for this (and not a custom subclass):

  • We use a similar approach for Series, which can also hold either ndarray or EA as underlying values
  • I don't think we want to start adding subclasses for all new EAs (like Int64Index for the new Int64Dtype is not quite possible ..)
  • This also works for external EAs

For now, it just integrates it, so an EA can be put in an index and tested operations work. This PR does not (yet) handle ensuring that operations relying on the index' Engine are efficient (currently it simply uses np.asarray(EA) as the engine values, which eg will be object dtype for Int64Dtype)

cc @jbrockmendel @TomAugspurger @jreback

@jbrockmendel
Copy link
Member

This is the idea behind ExtensionIndex

@jreback
Copy link
Contributor

jreback commented May 13, 2020

I agree with @jbrockmendel here, this is the exactly the point of ExtensionIndex. trying to modify Index doesn't make much sense.

@jorisvandenbossche
Copy link
Member Author

Sorry, I should have been more explicit regarding ExtensionIndex in my reasoning above:

  • Yes, this is (somewhat/fully) duplicating ExtensionIndex. But I think some things in ExtensionIndex can just be moved to the base class index. Again, for Series we have exactly the same approach of having both ndarray or EA as underlying values (we don't have an "ExtensionSeries").
  • Currently, the ExtensionIndex is only used as base class that concrete Index classes subclass from. Here, it is meant to be the actual class that users use/see. I personally don't think this should be "ExtensionIndex"

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 13, 2020

This is the idea behind ExtensionIndex

@jbrockmendel to make it concrete: are there (technical) reasons for you that are specific to Index and that make it more important to use a custom ExtensionIndex subclass instead of handling this in the base Index class?

For example, both have a __getitem__, which in both cases dispatch to self._data[..], but with some different checks (eg the base Index also checks for object boolean masks). But is there a fundamental reason those can't be merged?

Other example: the dropna in both is exactly the same, except for the use of self._data vs self._values, but self._values is defined to be self._data.
For repeat the same.
For take as well, except that the base class method just has a branch for _can_hold_na being False as well (which is never the case for the ExtensionIndex)

(a pre-cursor PR to this one could be to remove the duplicated methods between Index and ExtensionIndex)

@jbrockmendel
Copy link
Member

to make it concrete: are there (technical) reasons for you that are specific to Index and that make it more important to use a custom ExtensionIndex subclass instead of handling this in the base Index class?

Nothing that I have a particularly strong opinion about. If the implementation in this PR turns out to be elegant (unless asked, ill wait to review in detail until green), i wouldn't have a problem with it.

That said, ATM whenever we have an Index object we know it is object-dtype. I expect there are places in the code where we are relying on that. (might be worth making an ObjectIndex? orthogonal discussion)

@jreback
Copy link
Contributor

jreback commented May 13, 2020

@jorisvandenbossche yeah this is almost certainly not going to 100% overlap with an object dtyped Index base, because that serves 2 master right now, the object dtyped and a base index. If we would create a ObjectIndex then maybe what you are doing would work, but I find it easier / more compatible to simply have an ExtensionArrayIndex that can hold any type of EA.

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses labels Jun 14, 2020
arr = self._values
idx = arr._concat_same_type([arr[:loc], item, arr[loc:]])
else:
arr = np.asarray(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arr = self._values could go outside the if/elif

if isinstance(self._values, np.ndarray):
return self._values
else:
return np.asarray(self._values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt np.asarray(self._values) work unconditionally?

@@ -346,11 +346,13 @@ def __new__(
ea_cls = dtype.construct_array_type()
data = ea_cls._from_sequence(data, dtype=dtype, copy=False)
else:
data = np.asarray(data, dtype=object)
# TODO clean-up with extract_array ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 12, 2020
@jbrockmendel
Copy link
Member

@jorisvandenbossche im working on getting this feature out the door; i think we can close this to clear the queue

@jreback
Copy link
Contributor

jreback commented Nov 15, 2020

closing as duplicate of #37869 can move discussion there.

@jbrockmendel
Copy link
Member

@jorisvandenbossche I suggested closing this not based on the Index[EA] vs ExtensionIndex difference, but bc you don't seem to be working on it. If we get the other one working well, we can revisit the possibility of getting Index[EA] working.

(FWIW as I see more and more of the DTI/TDI/PI code getting shared, I've found myself thinking about the possibility those may not need to be subclasses at all [in this daydream, the subclass-specific stuff would be accessed by a .dt accessor xref #17134], which bears a resemblance to the Index[EA] design)

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

@jorisvandenbossche pls update or close

@jorisvandenbossche
Copy link
Member Author

We first need to agree on how to do this (see discussion above, and also the overview issue I created at #39133)

@mroeschke
Copy link
Member

Appears this PR draft has been dormant for a while so closing. Feel free to reopen when you have time to continue

@mroeschke mroeschke closed this Oct 31, 2021
@jorisvandenbossche
Copy link
Member Author

Note for future reference: this work is now actually done in #43930

@jorisvandenbossche jorisvandenbossche deleted the EA-index branch January 21, 2022 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants