-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge of sparse dataframe fails #13665
Comments
Thx for the report. Currently sparse doesn't support all the functionality and should raise understandable error. I feel supporting merge op in sparse data should have some limitations. Can you describe what type of data you actually have (pls show a sample data)? |
Thanks for reply. Would be better to have sensible error and list in the Even the small example I gave does not work. My real example has 32m rows On 16 July 2016 at 19:50, Sinhrks notifications@github.com wrote:
|
It may worth to try https://github.com/dask/dask. Pls let me know more about your data for future merging support.
|
unique key. In this case was inner join. first table just 2 columns. second will try dask. looks interesting On 16 July 2016 at 20:06, Sinhrks notifications@github.com wrote:
|
So I took a look at this. The problem is that inside of how merging is done the sparse blocks get cast to dense blocks while invoking The simple fix would be to convert the whole thing to a dense data frame although that seems confusing. So I will dig to see if I can use sparse blocks instead of the dense blocks. |
While this PR was not merged yet, it could be worth noting that merging currently also breaks when merging a sparse dataframe with a non-sparse one (the resulting dataframe may not be usable after the merge). |
This should be fixed on master. In [4]: df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
...: df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')).to_sparse()
...: df.merge(df2, how="left", on="A").head()
Out[4]:
A B_x C_x D_x B_y C_y D_y
0 93 45 47 17 55.0 14.0 70.0
1 93 45 47 17 11.0 59.0 79.0
2 93 45 47 17 36.0 97.0 61.0
3 93 45 47 17 42.0 14.0 24.0
4 83 68 60 85 NaN NaN NaN A confirmatory test and a release note would be appreciated, if anyone wants to do that. |
This works fine:
This fails with "TypeError: type object argument after * must be a sequence, not map"
The text was updated successfully, but these errors were encountered: