Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow get_dummies to return SparseDataFrame #8823

Closed
wants to merge 1 commit into from

Conversation

artemyk
Copy link
Contributor

@artemyk artemyk commented Nov 15, 2014

For dataframes with a large number of unique values, get_dummies can use enormous amounts of memory. This provides a sparse flag to get_dummies which returns a much more memory-efficient structure.

For example:

import pandas as pd
df1 = pd.get_dummies(pd.DataFrame({'a':range(1000)}, dtype='category').a, sparse=False)
print 'dense:', df1.memory_usage().sum()
df2 = pd.get_dummies(pd.DataFrame({'a':range(1000)}, dtype='category').a, sparse=True)
print 'sparse:', df2.memory_usage().sum()
pd.util.testing.assert_frame_equal(df1, df2)

returns

dense: 8000000
sparse: 8000

Performance could probably be improved a lot.

Fails pandas/tests/test_reshape.py:TestGetDummiesSparse.test_include_na due to #8822 .

@artemyk artemyk force-pushed the sparse_get_dummies branch 2 times, most recently from 81bd50e to ebafb72 Compare November 15, 2014 03:15
@artemyk artemyk changed the title Sparse get dummies ENH: Allow get_dummies to return SparseDataFrame Nov 15, 2014
@jreback jreback added Sparse Sparse Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 15, 2014
@jreback
Copy link
Contributor

jreback commented Mar 25, 2015

@artemyk can you rebase this. and you will need to compare vs Sparse (or not) in the tests.

@artemyk artemyk force-pushed the sparse_get_dummies branch 2 times, most recently from f04a698 to c54bb97 Compare March 26, 2015 05:12
@artemyk
Copy link
Contributor Author

artemyk commented Mar 26, 2015

@jreback Rebased on the #8822 branch

@artemyk artemyk force-pushed the sparse_get_dummies branch 2 times, most recently from 0ad8e08 to ba28f40 Compare April 8, 2015 22:08
@artemyk
Copy link
Contributor Author

artemyk commented Apr 8, 2015

@jreback This should be ready once #8822 is merged.
Regarding tests for comparing sparse vs. not sparse --- there are a battery of tests that compare get_dummies results against 'expected' versions. Since both sparse and non-sparse versions of get_dummies get compared to the same 'expected' dataframes, doesn't this effectively test that they are returning the same things?

@artemyk artemyk force-pushed the sparse_get_dummies branch 2 times, most recently from 7496d5d to 0c556df Compare April 12, 2015 01:57
ENH: Allow get_dummies to return sparse dataframe

ENH: Allow get_dummies to return sparse dataframe

Fix

Fix

Fixes

Bug in order of columns

Slight speed improvement

get_dummies update

Release notes update

Remove convert dummies test
@artemyk artemyk force-pushed the sparse_get_dummies branch from 0c556df to 7173395 Compare April 12, 2015 16:50
@artemyk
Copy link
Contributor Author

artemyk commented Apr 12, 2015

@jreback Ready to merge?

@@ -48,6 +48,7 @@ Enhancements
df.drop(['A', 'X'], axis=1, errors='ignore')

- Allow conversion of values with dtype ``datetime64`` or ``timedelta64`` to strings using ``astype(str)`` (:issue:`9757`)
- ``get_dummies`` function now accepts ``sparse`` keyword. If set to ``True``, the return DataFrame is sparse. (:issue:`8823`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put backticks around DataFrame.
say "the return DataFrame is sparse (e.g. SparseDataFrame)"

@jreback
Copy link
Contributor

jreback commented Apr 13, 2015

merged via 4673225

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants