-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: set accessor for Series (WIP) #21547
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests - they should be the first thing you write
Codecov Report
@@ Coverage Diff @@
## master #21547 +/- ##
==========================================
+ Coverage 91.91% 91.94% +0.02%
==========================================
Files 153 154 +1
Lines 49546 49716 +170
==========================================
+ Hits 45542 45709 +167
- Misses 4004 4007 +3
Continue to review full report at Codecov.
|
@jreback IMO there's something that needs to come before tests, and that's having a good idea about the API (which I can't decide by myself; hence the discussion here). Otherwise, a lot of time is sunk in writing obsolete tests. Could you please comment on your thoughts for the API choices so far? The rendered docstring should give a good starting point. |
@jreback I added over 200 tests (including parametrization almost everywhere) that cover everything I could think of. It's hopefully modular enough that renaming/changing the API won't be too much of a hassle. I also added the same |
at a glance it looks ok, but you have way too much complexity here. all of the Furthermore this is fundamentally flawed in that you can use this with anything, in reality this should be be restricted to an |
To be the devil's advocate here: given the new accessor registry api: isn't this a perfect case for having this in an external package? |
@jreback I don't care much if it's only To me, the whole point of giving an Aside from the templatization to avoid repetition, I went out of my way to keep complexity at a minimum. Could you tell me what you find complex? There's one clean-up method for Series, one clean-up for |
@jorisvandenbossche I maybe don't fully understand the implications/requirements of accessor registry, but since this is really just a wrapper to make the existing numpy methods nan-safe and index-aware, I think this should be part of the core, and not an external package? |
@h-vetinari |
@h-vetinari in reality this is much better done as a real ExtensionArray type. The point of the accessors is to extend with dtype specific functionaility. But first you have to have actual set operations supported. Furthermore these are quite non-performant. |
@jreback To which I responded as follows
I agree this is an important discussion - but there's normally lots of preprocessing necessary to get this to work with dirty real-world data, and every restriction makes it harder to actually transform real data. Something about consenting adults and all that. ;-) |
sure but that's what pandas does. set intelligent defaults. The issue I have is that this is the very first thing that works on containers inside a pandas object. Everything else is a scalar. Sure its 'ok' now, you 'can' use whatever you want but there is no first class support. So need to think about best way / if to provide this support. |
@jreback
But then there are no methods to transform the data to a point where the accessor can be used. At least
This is not a pressing issue IMO; the numpy methods are orders of magnitude faster than writing for-loops already. |
Lots of things work on containers already, like And people are using pandas in this way a lot - for clean data, one can just call |
@jreback How do you propose to continue with this? As I said above, I like the idea of:
But the problem is, I have no idea what creating a new dtype would imply. It would be easy enough to write the code for |
@jreback looking at #21160 and related PRs for example, that looks to be a much too tall of an order for me... Similarly for #8640 and the work you refer to in your last comment:
|
http://pandas-docs.github.io/pandas-docs-travis/extending.html#extension-types this should be an extension array. the actual impl is pretty easy, it could just e a list of sets. The problem is adding an accessor to disambiguate an object dtype is a bit odd. I am not totally against it, but having a defined dtype is much much cleaner. |
@jreback Not sure I follow completely - I'm afraid I'm getting a bit confused between implementing an EA (your link; apparently not so hard) and implementing a new dtype Further up you said:
which sounds like implementing a new dtype, and now:
which sounds (more) like the version I already implemented (or the same thing translated to EA)? Again, it's not clear to me what the requirements/recommendations are at this point. As a side note, I think that sets are such a fundamental building block in python (as opposed to IP addresses in cyperpandas or gps coordinates in geopandas), that the methods for them shouldn't be externalised to some package - this is what the EA documentation says, if I understand correctly. |
@h-vetinari whether this is external or not is slightly orthogonal to this discussion, the point of EA is that it can be external (or maybe start that way). EA is an extension array & a dtype, both of which are first class. That's how I'd like to see sets (and lists and dicts) |
closing as out of scope. This should be an EA type (could potentially be internal), but needs to be restructured to that. |
I also needed something like this, but as @jorisvandenbossche said, initially makes sense to have it as an extension. You can have a look here https://github.com/Florents-Tselai/pandas-sets - related post: https://tselai.com/pandas-sets.html. The implementation is very primitive (and based on May come back later with a PR. |
Closes #4480, and a fair bit more that was discussed in comments. In particular, it makes the
numpy
methods NaN-safe, index aware, and adds a bit of wrapping convenience.This is still a WIP, so no tests yet. Also more work to do on docstrings, whatsnew, and some overview documentation similar to
pandas/doc/source/text.rst
. The implementation borrows heavily frompandas/core/strings.py
.Before I write any tests etc., I'd like to have a discussion on the API - which functions to add (I only added union/intersection/difference/symmetric difference), how to name them, and what arguments they should have. I've followed the suggestions in #13877 (not yet realised for the
.str
-accessor) to have anerrors
- and afill_value
-parameter, as well as thejoin
-parameter from #20347.Here's a sample API documentation:
Here's some more usage examples. The basic idea is to follow the numpy set methods, which work with
| & ^ -
between arrays and arrays (as well as arrays and singletons), but are not NaN-safe. Basiscs:With another Series:
With different indices (
fill_value
also works for NaNs created by alignment):Finally, the behaviour of the
errors
-parameter. Since strings are not list-like, but can be coerced into a set, I made a distinction between'coerce'
and'wrap'
, which is the most permissive (but treats strings as singletons).