Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge of sparse dataframe fails #13665

Closed
simonm3 opened this issue Jul 15, 2016 · 7 comments · Fixed by #28425
Closed

merge of sparse dataframe fails #13665

simonm3 opened this issue Jul 15, 2016 · 7 comments · Fixed by #28425
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type

Comments

@simonm3
Copy link

simonm3 commented Jul 15, 2016

This works fine:

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.merge(df2, how="left", on="A")

This fails with "TypeError: type object argument after * must be a sequence, not map"

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
df.merge(df2, how="left", on="A")
@sinhrks
Copy link
Member

sinhrks commented Jul 16, 2016

Thx for the report. Currently sparse doesn't support all the functionality and should raise understandable error.

I feel supporting merge op in sparse data should have some limitations. Can you describe what type of data you actually have (pls show a sample data)?

@sinhrks sinhrks added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type Error Reporting Incorrect or improved errors from pandas labels Jul 16, 2016
@simonm3
Copy link
Author

simonm3 commented Jul 16, 2016

Thanks for reply. Would be better to have sensible error and list in the
docs what is and is not supported.

Even the small example I gave does not work. My real example has 32m rows
merged with 1m rows. It fails with memory error if not sparse. Is there a
way of merging two large dataframes without using any extra memory e.g.
inplace=True for merge? I tried copy=False but it did not help. My
workaround is to merge a few columns at a time which allows me to release
memory from the original once copied.

On 16 July 2016 at 19:50, Sinhrks notifications@github.com wrote:

Thx for the report. Currently sparse doesn't support all the functionality
and should raise understandable error.

I feel supporting merge op in sparse data should have some limitations.
Can you describe what type of data you actually have (pls show a sample
data)?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13665 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABJN6T9q2FErpN7LmjuM4mcISSKlvFIIks5qWSfxgaJpZM4JNd26
.

@sinhrks
Copy link
Member

sinhrks commented Jul 16, 2016

Is there a way of merging two large dataframes without using any extra memory

It may worth to try https://github.com/dask/dask.

Pls let me know more about your data for future merging support.

  • are merging keys unique?
  • they matches one to one (inner join meets your needs?)

@simonm3
Copy link
Author

simonm3 commented Jul 16, 2016

unique key. In this case was inner join. first table just 2 columns. second
table 500 columns.

will try dask. looks interesting

On 16 July 2016 at 20:06, Sinhrks notifications@github.com wrote:

Is there a way of merging two large dataframes without using any extra
memory

It may worth to try https://github.com/dask/dask.

Pls let me know more about your data for future merging support.

  • is merging key is unique?
  • they matches one to one (inner join meets your needs?)


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13665 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABJN6ehdEw6z6ALBPMMfdOp28qM6z0nuks5qWSvRgaJpZM4JNd26
.

@hexgnu
Copy link
Contributor

hexgnu commented Jan 2, 2018

So I took a look at this. The problem is that inside of how merging is done the sparse blocks get cast to dense blocks while invoking get_values().

The simple fix would be to convert the whole thing to a dense data frame although that seems confusing. So I will dig to see if I can use sparse blocks instead of the dense blocks.

@matanox
Copy link

matanox commented Sep 2, 2018

While this PR was not merged yet, it could be worth noting that merging currently also breaks when merging a sparse dataframe with a non-sparse one (the resulting dataframe may not be usable after the merge).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 17, 2018

This should be fixed on master.

In [4]: df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
   ...: df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
   ...: df.merge(df2, how="left", on="A").head()
Out[4]:
    A  B_x  C_x  D_x   B_y   C_y   D_y
0  93   45   47   17  55.0  14.0  70.0
1  93   45   47   17  11.0  59.0  79.0
2  93   45   47   17  36.0  97.0  61.0
3  93   45   47   17  42.0  14.0  24.0
4  83   68   60   85   NaN   NaN   NaN

A confirmatory test and a release note would be appreciated, if anyone wants to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
5 participants