-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allows for merging of SparseDataFrames, and fixes __array__ interface #19488
Changes from 21 commits
5b48613
555fb91
89677b0
97bd19b
b04ac64
d24e464
a796280
be09289
1aef901
42680ba
2084ed3
77c41b7
25fd08a
cd583f7
45e7cd3
6522d6b
029d37b
730c152
171f5dd
bde2588
bfe3065
d13daa7
3ddba8b
4fa7dec
353171b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -238,6 +238,7 @@ Other Enhancements | |
|
||
- ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`) | ||
- :class:`IntervalIndex` and its associated constructor methods (``from_arrays``, ``from_breaks``, ``from_tuples``) have gained a ``dtype`` parameter (:issue:`19262`) | ||
- :func:`pandas.merge` now supports merging of :class:`SparseDataFrame` (:issue:`13665`) | ||
|
||
.. _whatsnew_0230.api_breaking: | ||
|
||
|
@@ -555,7 +556,7 @@ Sparse | |
|
||
- Bug in which creating a ``SparseDataFrame`` from a dense ``Series`` or an unsupported type raised an uncontrolled exception (:issue:`19374`) | ||
- Bug in :class:`SparseDataFrame.to_csv` causing exception (:issue:`19384`) | ||
- | ||
- Bug in :class:`SparseSeries.__array__` returning only non-faills (:issue:`13665`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Typo in "faills"? |
||
|
||
Reshaping | ||
^^^^^^^^^ | ||
|
@@ -591,3 +592,4 @@ Other | |
^^^^^ | ||
|
||
- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`) | ||
- Improved algorithms.take_1d handling of ``SparseArray`` (:issue:`19506`) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1315,6 +1315,11 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, | |
undefined if allow_fill == False and -1 is present in indexer. | ||
""" | ||
|
||
if is_sparse(arr): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So something that is a bit of a hotspot IMHO is that inside of algos it iterates through the memview and could cause some weird security / segfault issues. This fills that hole but really should fix that at a lower level. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, this seems sub-optimal. We shouldn't have to densify just to slice, right? Can you take a look at And since you're touching performance-related code, could you run the sparse-related ASVs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thank you for pointing this out. exactly what I was looking for. Will run the ASVs |
||
return take_nd(arr.get_values(), indexer, axis=axis, out=out, | ||
fill_value=fill_value, mask_info=mask_info, | ||
allow_fill=allow_fill) | ||
|
||
# dispatch to internal type takes | ||
if is_categorical(arr): | ||
return arr.take_nd(indexer, fill_value=fill_value, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3089,6 +3089,7 @@ def make_block(values, placement, klass=None, ndim=None, dtype=None, | |
# GH#19265 pyarrow is passing this | ||
warnings.warn("fastpath argument is deprecated, will be removed " | ||
"in a future release.", DeprecationWarning) | ||
|
||
if klass is None: | ||
dtype = dtype or values.dtype | ||
klass = get_block_type(values, dtype) | ||
|
@@ -5304,6 +5305,22 @@ def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy): | |
elif is_uniform_join_units(join_units): | ||
b = join_units[0].block.concat_same_type( | ||
[ju.block for ju in join_units], placement=placement) | ||
elif is_sparse_join_units(join_units): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you ever go down the initial |
||
values = concatenate_join_units(join_units, concat_axis, copy=copy) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a mess There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, yes it is. I'll work on cleaning this up tomorrow since there's too many branches in this code. |
||
|
||
if len(values.shape) == 2: | ||
values = values[0] | ||
else: | ||
assert len(values.shape) == 1 | ||
|
||
block = join_units[0].block | ||
|
||
if block: | ||
fill_value = block.fill_value | ||
else: | ||
fill_value = np.nan | ||
array = SparseArray(values, fill_value=fill_value) | ||
b = make_block(array, klass=SparseBlock, placement=placement) | ||
else: | ||
b = make_block( | ||
concatenate_join_units(join_units, concat_axis, copy=copy), | ||
|
@@ -5313,6 +5330,18 @@ def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy): | |
return BlockManager(blocks, axes) | ||
|
||
|
||
def is_sparse_join_units(join_units): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Docstring would be nice, at least noting that this is true if any blocks are sparse. |
||
""" | ||
Check if all of the join units are sparse. This leads to building | ||
SparseArray over dense array representations so that we can merge | ||
SparseSeries / SparseDataFrame | ||
|
||
This is very similar to how pandas.concat works for conatting two | ||
SparseDataFrame / SparseSeries | ||
""" | ||
return all(type(ju.block) is SparseBlock for ju in join_units) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is very similar (and maybe could be DRYed up) like pd.concat([sparse, sparse]) |
||
|
||
|
||
def is_uniform_join_units(join_units): | ||
""" | ||
Check if the join units consist of blocks of uniform type that can | ||
|
@@ -5686,7 +5715,10 @@ def is_na(self): | |
def get_reindexed_values(self, empty_dtype, upcasted_na): | ||
if upcasted_na is None: | ||
# No upcasting is necessary | ||
fill_value = self.block.fill_value | ||
|
||
# You would think that you want self.block.fill_value here | ||
# But in reality that will fill with a bunch of wrong values | ||
fill_value = np.nan | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this was surprising to me. If you don't pass in np.nan what ends up happening is that if you merge two sparse frames it will concat with a bunch of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. huh? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea this confused me a bunch. evens = pd.DataFrame({'A': range(0, 20, 2), 'B': range(10)})
threes = pd.DataFrame({'A': range(0, 30, 3), 'B': range(10)})
threes.merge(evens, how="left", on="A") Yields
As you'd expect for a dense dataframe but if you sparsify it with a fill_value other than np.nan / None you get weird results evens = pd.DataFrame({'A': range(0, 20, 2), 'B': range(10)}).to_sparse(fill_value=8675309)
threes = pd.DataFrame({'A': range(0, 30, 3), 'B': range(10)}).to_sparse(fill_value=90210)
threes.merge(evens, how="left", on="A") Yields
Is that expected behavior? |
||
values = self.block.get_values() | ||
else: | ||
fill_value = upcasted_na | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,6 +38,8 @@ | |
concatenate_block_managers) | ||
from pandas.util._decorators import Appender, Substitution | ||
|
||
from pandas.core.sparse.array import SparseArray | ||
|
||
from pandas.core.sorting import is_int64_overflow_possible | ||
import pandas.core.algorithms as algos | ||
import pandas.core.sorting as sorting | ||
|
@@ -665,7 +667,6 @@ def _maybe_restore_index_levels(self, result): | |
result.set_index(names_to_restore, inplace=True) | ||
|
||
def _maybe_add_join_keys(self, result, left_indexer, right_indexer): | ||
|
||
left_has_missing = None | ||
right_has_missing = None | ||
|
||
|
@@ -731,7 +732,11 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer): | |
if mask.all(): | ||
key_col = rvals | ||
else: | ||
key_col = Index(lvals).where(~mask, rvals) | ||
# Might need to be IntIndex not Index | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't do this, use |
||
if isinstance(lvals, SparseArray): | ||
key_col = Index(lvals.get_values()).where(~mask, rvals) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this has memory or performance issues. But this is the best solution I could come to with this. The other solution would be to look at using lvals.sp_index and implement a where on it that works. One thing I have noticed is that IntIndex doesn't act quite like Index which makes for doing these masks tricky in sparse land. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd be nice to avoid |
||
else: | ||
key_col = Index(lvals).where(~mask, rvals) | ||
|
||
if result._is_label_reference(name): | ||
result[name] = key_col | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,7 +38,6 @@ | |
from pandas.util._decorators import Appender | ||
from pandas.core.indexes.base import _index_shared_docs | ||
|
||
|
||
_sparray_doc_kwargs = dict(klass='SparseArray') | ||
|
||
|
||
|
@@ -259,6 +258,7 @@ def __array_wrap__(self, out_arr, context=None): | |
ufunc, args, domain = context | ||
# to apply ufunc only to fill_value (to avoid recursive call) | ||
args = [getattr(a, 'fill_value', a) for a in args] | ||
|
||
with np.errstate(all='ignore'): | ||
fill_value = ufunc(self.fill_value, *args[1:]) | ||
else: | ||
|
@@ -292,9 +292,9 @@ def __setstate__(self, state): | |
self._fill_value = fill_value | ||
|
||
def __len__(self): | ||
try: | ||
if hasattr(self, 'sp_index'): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This felt really icky to me to try and catch something... Feel free to push back on that change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How was this failing before? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have no clue! 😄 I couldn't figure out why the try / except was there in the first place tbh. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. huh? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't essential to this PR. I just don't understand why this code was here in the first place. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Meaning I can take it out to bring down the code changes |
||
return self.sp_index.length | ||
except: | ||
else: | ||
return 0 | ||
|
||
def __unicode__(self): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -73,6 +73,10 @@ def __init__(self, data=None, index=None, columns=None, default_kind=None, | |
if columns is None: | ||
raise Exception("cannot pass a series w/o a name or columns") | ||
data = {columns[0]: data} | ||
elif isinstance(data, BlockManager): | ||
fill_value_size = len(set(b.fill_value for b in data.blocks)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I need feedback on this change. Basically what this does is: if every SparseSeries's fill_value is the same it sets the default_fill_value to that. This to me seems really intuitive, but I trust your judgement. It also makes testing a whole of a hell lot easier which is why I did it. I would argue pretty heavily that as a user if I added sparse series that all had fill_value=0 then I would expect the sparse data frame to have default_fill_value = 0 as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes this is the only reason to have a standalone |
||
if default_fill_value is None and fill_value_size == 1: | ||
default_fill_value = data.blocks[0].fill_value | ||
|
||
if default_fill_value is None: | ||
default_fill_value = np.nan | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -175,7 +175,7 @@ def values(self): | |
|
||
def __array__(self, result=None): | ||
""" the array interface, return my values """ | ||
return self.block.values | ||
return self.block.values.values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will return the dense version of SparseSeries when calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we need some more asv's for sparse I think |
||
|
||
def get_values(self): | ||
""" same as values """ | ||
|
@@ -271,6 +271,7 @@ def __array_wrap__(self, result, context=None): | |
|
||
See SparseArray.__array_wrap__ for detail. | ||
""" | ||
|
||
if isinstance(context, tuple) and len(context) == 3: | ||
ufunc, args, domain = context | ||
args = [getattr(a, 'fill_value', a) for a in args] | ||
|
@@ -279,8 +280,18 @@ def __array_wrap__(self, result, context=None): | |
else: | ||
fill_value = self.fill_value | ||
|
||
# GH 14167 | ||
# Since we are returning a dense representation of | ||
# SparseSeries sparse_index might not align when calling | ||
# ufunc on the array. There doesn't seem to be a better way | ||
# to do this unfortunately. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This wasn't intuitive to me but the array*_ functions defined in numpy don't work quite as expected in certain circumstances.
What would happen to the dense version of the array would be that it would have 3 npoints when really it should only have 2. By zeroing out the sparse_index it fixes the problem, because it relies on the dense index instead going around the problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these should simply operate on the sp_values, not the on the densified (or maybe its possible that some classes of ufuncs could operate on the dense, but not sure) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so the expected behavior is |
||
if len(result) != self.sp_index.npoints: | ||
sparse_index = None | ||
else: | ||
sparse_index = self.sp_index | ||
|
||
return self._constructor(result, index=self.index, | ||
sparse_index=self.sp_index, | ||
sparse_index=sparse_index, | ||
fill_value=fill_value, | ||
copy=False).__finalize__(self) | ||
|
||
|
@@ -402,8 +413,8 @@ def abs(self): | |
------- | ||
abs: type of caller | ||
""" | ||
return self._constructor(np.abs(self.values), | ||
index=self.index).__finalize__(self) | ||
|
||
return np.abs(self) | ||
|
||
def get(self, label, default=None): | ||
""" | ||
|
@@ -544,7 +555,7 @@ def to_dense(self, sparse_only=False): | |
index = self.index.take(int_index.indices) | ||
return Series(self.sp_values, index=index, name=self.name) | ||
else: | ||
return Series(self.values.to_dense(), index=self.index, | ||
return Series(self.get_values(), index=self.index, | ||
name=self.name) | ||
|
||
@property | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify: merging sparse to sparse? Sparse and dense?