ENH: set accessor for Series (WIP) #21547

h-vetinari · 2018-06-19T21:33:23Z

Closes #4480, and a fair bit more that was discussed in comments. In particular, it makes the numpy methods NaN-safe, index aware, and adds a bit of wrapping convenience.

This is still a WIP, so no tests yet. Also more work to do on docstrings, whatsnew, and some overview documentation similar to pandas/doc/source/text.rst. The implementation borrows heavily from pandas/core/strings.py.

Before I write any tests etc., I'd like to have a discussion on the API - which functions to add (I only added union/intersection/difference/symmetric difference), how to name them, and what arguments they should have. I've followed the suggestions in #13877 (not yet realised for the .str-accessor) to have an errors- and a fill_value-parameter, as well as the join-parameter from #20347.

Here's a sample API documentation:

Here's some more usage examples. The basic idea is to follow the numpy set methods, which work with | & ^ - between arrays and arrays (as well as arrays and singletons), but are not NaN-safe. Basiscs:

s = pd.Series([{1, 2}, {2, 4}, {3, 1}])
s
# 0    {1, 2}
# 1    {2, 4}
# 2    {1, 3}
# dtype: object

s.set.union()  # apply (sequentially) to elements of calling Series
# {1, 2, 3, 4}

s.set.union({2})  # broadcast like "s.values | {2}"
# 0       {1, 2}
# 1       {2, 4}
# 2    {1, 2, 3}
# dtype: object

With another Series:

t = pd.Series([{2, 3}, {1, 2}, np.nan])
t
# 0    {2, 3}
# 1    {1, 2}
# 2       NaN
# dtype: object

s.values | t.values  # remember...
# TypeError

s.set.union(t)
# 0    {1, 2, 3}
# 1    {1, 2, 4}
# 2          NaN
# dtype: object

s.set.union(t, fill_value={5})
# 0    {1, 2, 3}
# 1    {1, 2, 4}
# 2    {1, 3, 5}
# dtype: object

With different indices (fill_value also works for NaNs created by alignment):

u = pd.Series(t.values, index=[1, 2, 3])
u
# 1    {2, 3}
# 2    {1, 2}
# 3       NaN
# dtype: object

s.set.union(u)
# 0          NaN
# 1    {2, 3, 4}
# 2    {1, 2, 3}
# dtype: object

s.set.union(u, join='outer')
# 0          NaN
# 1    {2, 3, 4}
# 2    {1, 2, 3}
# 3          NaN
# dtype: object

s.set.union(u, join='outer', fill_value={5})
# 0    {1, 2, 5}
# 1    {2, 3, 4}
# 2    {1, 2, 3}
# 3          {5}
# dtype: object

Finally, the behaviour of the errors-parameter. Since strings are not list-like, but can be coerced into a set, I made a distinction between 'coerce' and 'wrap', which is the most permissive (but treats strings as singletons).

v = pd.Series([{1, 2}, [2, 3], 'abcd', 4, np.nan])
v
# 0    {1, 2}
# 1    [2, 3]
# 2      abcd
# 3         4
# 4       NaN
# dtype: object

v.set.union(set(), errors='raise')  # default
# ValueError

v.set.union(set(), errors='ignore')
# 0    {1, 2}
# 1    {2, 3}
# 2       NaN
# 3       NaN
# 4       NaN
# dtype: object

v.set.union(set(), errors='coerce')
# 0          {1, 2}
# 1          {2, 3}
# 2    {a, c, b, d}
# 3             NaN
# 4             NaN
# dtype: object

v.set.union(set(), errors='wrap')
# 0    {1, 2}
# 1    {2, 3}
# 2    {abcd}
# 3       {4}
# 4       NaN
# dtype: object

jreback

tests - they should be the first thing you write

codecov · 2018-06-20T06:43:17Z

Codecov Report

Merging #21547 into master will increase coverage by 0.02%.
The diff coverage is 98.05%.

@@            Coverage Diff             @@
##           master   #21547      +/-   ##
==========================================
+ Coverage   91.91%   91.94%   +0.02%     
==========================================
  Files         153      154       +1     
  Lines       49546    49716     +170     
==========================================
+ Hits        45542    45709     +167     
- Misses       4004     4007       +3

Flag	Coverage Δ
#multiple	`90.34% <98.05%> (+0.02%)`	⬆️
#single	`41.78% <33.76%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/series.py	`94.2% <100%> (+0.02%)`	⬆️
pandas/core/sets.py	`98.01% <98.01%> (ø)`
pandas/core/arrays/categorical.py	`95.63% <0%> (-0.06%)`	⬇️
pandas/core/generic.py	`96.12% <0%> (-0.01%)`	⬇️
pandas/core/indexes/base.py	`96.62% <0%> (ø)`	⬆️
pandas/core/indexes/category.py	`97.28% <0%> (+0.18%)`	⬆️
pandas/core/dtypes/cast.py	`88.49% <0%> (+0.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fbb683...8c98d30. Read the comment docs.

h-vetinari · 2018-06-20T06:47:58Z

@jreback IMO there's something that needs to come before tests, and that's having a good idea about the API (which I can't decide by myself; hence the discussion here). Otherwise, a lot of time is sunk in writing obsolete tests.

Could you please comment on your thoughts for the API choices so far? The rendered docstring should give a good starting point.

h-vetinari · 2018-06-20T17:53:22Z

@jreback I added over 200 tests (including parametrization almost everywhere) that cover everything I could think of. It's hopefully modular enough that renaming/changing the API won't be too much of a hassle.

I also added the same 'errors'/'fill_value' parameters to set.len, which I could therefore refactor (and cut out a lot of code I had pasted from pandas/core/strings.py).

jreback · 2018-06-21T09:57:33Z

at a glance it looks ok, but you have way too much complexity here. all of the errors options and the fill_value. some simpler errors are ok, but still to the coerce, ignore, raise options.

Furthermore this is fundamentally flawed in that you can use this with anything, in reality this should be be restricted to an inferred_dtype=='set', so that's the first thing to change.

jorisvandenbossche · 2018-06-21T10:05:12Z

To be the devil's advocate here: given the new accessor registry api: isn't this a perfect case for having this in an external package?

h-vetinari · 2018-06-21T10:21:06Z

@jreback I don't care much if it's only 'raise'|'ignore'|'coerce', just added 'wrap' for convenience and 'skip' for performance. The 'errors'/'fill_value' pair is used in many places, and propsed for .str-methods in #13877, so it made sense for me to emulate that. fill_value has negligible complexity attached, BTW.

To me, the whole point of giving an 'errors'-parameter is to be able to use it with anything, and whether it's lists or strings or whatnot, this should be made available IMO. Especially since it's such a bitch to prepare real-world data to a point where it's just sets (not least because fillna doesn't take lists or sets).

Aside from the templatization to avoid repetition, I went out of my way to keep complexity at a minimum. Could you tell me what you find complex? There's one clean-up method for Series, one clean-up for others, one method for executing the numpy method (and one putting the pieces together), all nicely segregated...

h-vetinari · 2018-06-21T10:21:15Z

@jorisvandenbossche I maybe don't fully understand the implications/requirements of accessor registry, but since this is really just a wrapper to make the existing numpy methods nan-safe and index-aware, I think this should be part of the core, and not an external package?

jreback · 2018-06-21T10:26:39Z

@h-vetinari
pls read my last point, which is the main point. sets are not a first class type, that is the first problem to address.

jreback · 2018-06-21T10:28:26Z

@h-vetinari in reality this is much better done as a real ExtensionArray type. The point of the accessors is to extend with dtype specific functionaility. But first you have to have actual set operations supported.

Furthermore these are quite non-performant.

h-vetinari · 2018-06-21T10:29:39Z

@jreback To which I responded as follows

To me, the whole point of giving an 'errors'-parameter is to be able to use it with anything, [...]

I agree this is an important discussion - but there's normally lots of preprocessing necessary to get this to work with dirty real-world data, and every restriction makes it harder to actually transform real data. Something about consenting adults and all that. ;-)

jreback · 2018-06-21T10:33:58Z

agree this is an important discussion - but there's normally lots of preprocessing necessary to get this to work with dirty real-world data, and every restriction makes it harder to actually transform real data. Something about consenting adults and all that. ;-)

sure but that's what pandas does. set intelligent defaults.

The issue I have is that this is the very first thing that works on containers inside a pandas object. Everything else is a scalar. Sure its 'ok' now, you 'can' use whatever you want but there is no first class support. So need to think about best way / if to provide this support.

h-vetinari · 2018-06-21T10:34:50Z

@jreback
Re:

@h-vetinari in reality this is much better done as a real ExtensionArray type. The point of the accessors is to extend with dtype specific functionaility. But first you have to have actual set operations supported.

But then there are no methods to transform the data to a point where the accessor can be used. At least .str has .astype(str). Of course, then the 'errors'/'fill_value' kwargs could be positioned in a potential .astype(set, errors, fill_value) call, which does have a certain appeal, I admit.

Furthermore these are quite non-performant.

This is not a pressing issue IMO; the numpy methods are orders of magnitude faster than writing for-loops already.

h-vetinari · 2018-06-21T10:43:20Z

@jreback

The issue I have is that this is the very first thing that works on containers inside a pandas object. Everything else is a scalar. Sure its 'ok' now, you 'can' use whatever you want but there is no first class support. So need to think about best way / if to provide this support.

Lots of things work on containers already, like .str.len(), .str[slice], for lists etc (yes, this is deliberately misusing the .str-accessor, but it works). Also, .str.extract and str.split with expand=False yield lists in Series, for example.

And people are using pandas in this way a lot - for clean data, one can just call map(set), map(list), 'map(len)` etc. and continue, but often the clean-up is the biggest headache. Granted, sometimes it's just an intermediate processing step on the way to a DF or a MultiIndex, but sometimes its genuinely a set/list that needs per-row processing. There's several SO questions on this, for example.

h-vetinari · 2018-06-22T06:04:44Z

@jreback How do you propose to continue with this? As I said above, I like the idea of:

a set dtype
having all the conversion logic in .astype(set, errors, fill_value)
having the .set accessor only for that dtype, and then the methods don't need the kwargs themselves

But the problem is, I have no idea what creating a new dtype would imply. It would be easy enough to write the code for .astype(set, errors, fill_value), but I'm guessing that's not nearly enough.

h-vetinari · 2018-06-22T06:26:29Z

@jreback looking at #21160 and related PRs for example, that looks to be a much too tall of an order for me...

Similarly for #8640 and the work you refer to in your last comment:

nice library by xhochy https://github.com/xhochy/fletcher

jreback · 2018-06-22T09:57:37Z

http://pandas-docs.github.io/pandas-docs-travis/extending.html#extension-types

this should be an extension array. the actual impl is pretty easy, it could just e a list of sets. The problem is adding an accessor to disambiguate an object dtype is a bit odd. I am not totally against it, but having a defined dtype is much much cleaner.

h-vetinari · 2018-06-22T10:20:07Z

@jreback Not sure I follow completely - I'm afraid I'm getting a bit confused between implementing an EA (your link; apparently not so hard) and implementing a new dtype set (very hard?).

Further up you said:

[...] this should be be restricted to an inferred_dtype=='set', so that's the first thing to change.

which sounds like implementing a new dtype, and now:

[...] adding an accessor to disambiguate an object dtype is a bit odd.

which sounds (more) like the version I already implemented (or the same thing translated to EA)? Again, it's not clear to me what the requirements/recommendations are at this point.

As a side note, I think that sets are such a fundamental building block in python (as opposed to IP addresses in cyperpandas or gps coordinates in geopandas), that the methods for them shouldn't be externalised to some package - this is what the EA documentation says, if I understand correctly.

jreback · 2018-06-22T10:37:00Z

@h-vetinari whether this is external or not is slightly orthogonal to this discussion, the point of EA is that it can be external (or maybe start that way).

EA is an extension array & a dtype, both of which are first class. That's how I'd like to see sets (and lists and dicts)

jreback · 2018-09-25T16:29:51Z

closing as out of scope. This should be an EA type (could potentially be internal), but needs to be restructured to that.

Florents-Tselai · 2018-12-26T16:58:22Z

I also needed something like this, but as @jorisvandenbossche said, initially makes sense to have it as an extension. pip install pandas-sets

You can have a look here https://github.com/Florents-Tselai/pandas-sets - related post: https://tselai.com/pandas-sets.html.

The implementation is very primitive (and based on StringMethods) , but it does the job.

May come back later with a PR.

ENH: set accessor (WIP)

a3563da

h-vetinari force-pushed the set_accessor branch from 64b4bd0 to a3563da Compare June 19, 2018 21:53

jreback requested changes Jun 19, 2018

View reviewed changes

Added lots of tests; many fixes; refactored len

8c98d30

gfyoung added Enhancement Dtype Conversions Unexpected or buggy dtype conversions labels Jun 21, 2018

h-vetinari mentioned this pull request Aug 16, 2018

WIP: EA SetArray #22382

Closed

jreback closed this Sep 25, 2018

h-vetinari mentioned this pull request Nov 20, 2018

ExtensionDtype should be hashable #22476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: set accessor for Series (WIP) #21547

ENH: set accessor for Series (WIP) #21547

h-vetinari commented Jun 19, 2018 •

edited

Loading

jreback left a comment

codecov bot commented Jun 20, 2018 •

edited

Loading

h-vetinari commented Jun 20, 2018

h-vetinari commented Jun 20, 2018 •

edited

Loading

jreback commented Jun 21, 2018

jorisvandenbossche commented Jun 21, 2018

h-vetinari commented Jun 21, 2018 •

edited

Loading

h-vetinari commented Jun 21, 2018

jreback commented Jun 21, 2018

jreback commented Jun 21, 2018 •

edited

Loading

h-vetinari commented Jun 21, 2018

jreback commented Jun 21, 2018

h-vetinari commented Jun 21, 2018 •

edited

Loading

h-vetinari commented Jun 21, 2018 •

edited

Loading

h-vetinari commented Jun 22, 2018

h-vetinari commented Jun 22, 2018 •

edited

Loading

jreback commented Jun 22, 2018

h-vetinari commented Jun 22, 2018

jreback commented Jun 22, 2018

jreback commented Sep 25, 2018

Florents-Tselai commented Dec 26, 2018

ENH: set accessor for Series (WIP) #21547

ENH: set accessor for Series (WIP) #21547

Conversation

h-vetinari commented Jun 19, 2018 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 20, 2018 • edited Loading

Codecov Report

h-vetinari commented Jun 20, 2018

h-vetinari commented Jun 20, 2018 • edited Loading

jreback commented Jun 21, 2018

jorisvandenbossche commented Jun 21, 2018

h-vetinari commented Jun 21, 2018 • edited Loading

h-vetinari commented Jun 21, 2018

jreback commented Jun 21, 2018

jreback commented Jun 21, 2018 • edited Loading

h-vetinari commented Jun 21, 2018

jreback commented Jun 21, 2018

h-vetinari commented Jun 21, 2018 • edited Loading

h-vetinari commented Jun 21, 2018 • edited Loading

h-vetinari commented Jun 22, 2018

h-vetinari commented Jun 22, 2018 • edited Loading

jreback commented Jun 22, 2018

h-vetinari commented Jun 22, 2018

jreback commented Jun 22, 2018

jreback commented Sep 25, 2018

Florents-Tselai commented Dec 26, 2018

h-vetinari commented Jun 19, 2018 •

edited

Loading

codecov bot commented Jun 20, 2018 •

edited

Loading

h-vetinari commented Jun 20, 2018 •

edited

Loading

h-vetinari commented Jun 21, 2018 •

edited

Loading

jreback commented Jun 21, 2018 •

edited

Loading

h-vetinari commented Jun 21, 2018 •

edited

Loading

h-vetinari commented Jun 21, 2018 •

edited

Loading

h-vetinari commented Jun 22, 2018 •

edited

Loading