Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Categorical gotchas says memory usage is O(nm), when it is actually O(n+m) #17705

Closed
RauliRuohonen opened this issue Sep 28, 2017 · 2 comments · Fixed by #17736
Closed
Labels
Categorical Categorical Data Type Docs
Milestone

Comments

@RauliRuohonen
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd

def make_series(categories, repeats):
    categories = [str(i) for i in range(categories)]
    return pd.Series(categories*repeats).astype('category')

def series_info(series):
    true_size = series.nbytes
    n = len(series.cat.categories)
    m = len(series.cat.codes)
    expected_size = n*8+m  # Note: not n*m*constant
    print('Number of categories (n): %d, length of data (m): %d, bytes: %d, '
          '8n+m: %d, nm: %d' % (n, m, true_size, expected_size, n*m))

series_info(make_series(123, 456))

# Output:
#  Number of categories (n): 123, length of data (m): 56088, bytes: 57072, 8n+m: 57072, nm: 6898824
#
# Note that 8n+m matches true size as it should. nm is much larger.

Problem description

The documentation says:

The memory usage of a Categorical is proportional to the number of categories times the length of the data.

If there are n categories and the length of data is m, this means that memory usage is O(nm). As the script above shows however, the true usage is O(n+m). The claimed usage O(nm) suggests that the implementation would store the categories using one-hot encoding, which would be strange to say the least.

@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

well if you want to issue a PR to correct the documentation that would be fine.

@berkay-dincer
Copy link
Contributor

@jreback @RauliRuohonen created PR #17736 to solve this issue.

@jreback jreback modified the milestones: Next Major Release, 0.21.0 Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Docs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants