merge of sparse dataframe fails #13665

simonm3 · 2016-07-15T14:43:01Z

This works fine:

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.merge(df2, how="left", on="A")

This fails with "TypeError: type object argument after * must be a sequence, not map"

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
df.merge(df2, how="left", on="A")

The text was updated successfully, but these errors were encountered:

sinhrks · 2016-07-16T18:50:09Z

Thx for the report. Currently sparse doesn't support all the functionality and should raise understandable error.

I feel supporting merge op in sparse data should have some limitations. Can you describe what type of data you actually have (pls show a sample data)?

simonm3 · 2016-07-16T18:58:36Z

Thanks for reply. Would be better to have sensible error and list in the
docs what is and is not supported.

Even the small example I gave does not work. My real example has 32m rows
merged with 1m rows. It fails with memory error if not sparse. Is there a
way of merging two large dataframes without using any extra memory e.g.
inplace=True for merge? I tried copy=False but it did not help. My
workaround is to merge a few columns at a time which allows me to release
memory from the original once copied.

On 16 July 2016 at 19:50, Sinhrks notifications@github.com wrote:

Thx for the report. Currently sparse doesn't support all the functionality
and should raise understandable error.

I feel supporting merge op in sparse data should have some limitations.
Can you describe what type of data you actually have (pls show a sample
data)?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13665 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABJN6T9q2FErpN7LmjuM4mcISSKlvFIIks5qWSfxgaJpZM4JNd26
.

sinhrks · 2016-07-16T19:06:41Z

Is there a way of merging two large dataframes without using any extra memory

It may worth to try https://github.com/dask/dask.

Pls let me know more about your data for future merging support.

are merging keys unique?
they matches one to one (inner join meets your needs?)

simonm3 · 2016-07-16T19:11:12Z

unique key. In this case was inner join. first table just 2 columns. second
table 500 columns.

will try dask. looks interesting

On 16 July 2016 at 20:06, Sinhrks notifications@github.com wrote:

Is there a way of merging two large dataframes without using any extra
memory

It may worth to try https://github.com/dask/dask.

Pls let me know more about your data for future merging support.

is merging key is unique?

they matches one to one (inner join meets your needs?)

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13665 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABJN6ehdEw6z6ALBPMMfdOp28qM6z0nuks5qWSvRgaJpZM4JNd26
.

hexgnu · 2018-01-02T17:09:36Z

So I took a look at this. The problem is that inside of how merging is done the sparse blocks get cast to dense blocks while invoking get_values().

The simple fix would be to convert the whole thing to a dense data frame although that seems confusing. So I will dig to see if I can use sparse blocks instead of the dense blocks.

matanox · 2018-09-02T16:07:27Z

While this PR was not merged yet, it could be worth noting that merging currently also breaks when merging a sparse dataframe with a non-sparse one (the resulting dataframe may not be usable after the merge).

TomAugspurger · 2018-10-17T13:49:08Z

This should be fixed on master.

In [4]: df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
   ...: df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
   ...: df.merge(df2, how="left", on="A").head()
Out[4]:
    A  B_x  C_x  D_x   B_y   C_y   D_y
0  93   45   47   17  55.0  14.0  70.0
1  93   45   47   17  11.0  59.0  79.0
2  93   45   47   17  36.0  97.0  61.0
3  93   45   47   17  42.0  14.0  24.0
4  83   68   60   85   NaN   NaN   NaN

A confirmatory test and a release note would be appreciated, if anyone wants to do that.

sinhrks added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type Error Reporting Incorrect or improved errors from pandas labels Jul 16, 2016

hexgnu mentioned this issue Feb 1, 2018

Allows for merging of SparseDataFrames, and fixes __array__ interface #19488

Closed

4 tasks

TomAugspurger added Effort Low good first issue labels Oct 17, 2018

TomAugspurger mentioned this issue Sep 16, 2019

Remove SparseSeries and SparseDataFrame #28425

Merged

TomAugspurger closed this as completed in #28425 Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge of sparse dataframe fails #13665

merge of sparse dataframe fails #13665

simonm3 commented Jul 15, 2016

sinhrks commented Jul 16, 2016

simonm3 commented Jul 16, 2016

sinhrks commented Jul 16, 2016 •

edited

Loading

simonm3 commented Jul 16, 2016

hexgnu commented Jan 2, 2018

matanox commented Sep 2, 2018 •

edited

Loading

TomAugspurger commented Oct 17, 2018 •

edited

Loading

merge of sparse dataframe fails #13665

merge of sparse dataframe fails #13665

Comments

simonm3 commented Jul 15, 2016

sinhrks commented Jul 16, 2016

simonm3 commented Jul 16, 2016

sinhrks commented Jul 16, 2016 • edited Loading

simonm3 commented Jul 16, 2016

hexgnu commented Jan 2, 2018

matanox commented Sep 2, 2018 • edited Loading

TomAugspurger commented Oct 17, 2018 • edited Loading

sinhrks commented Jul 16, 2016 •

edited

Loading

matanox commented Sep 2, 2018 •

edited

Loading

TomAugspurger commented Oct 17, 2018 •

edited

Loading