DOC: Categorical gotchas says memory usage is O(nm), when it is actually O(n+m) #17705

RauliRuohonen · 2017-09-28T11:15:16Z

Code Sample, a copy-pastable example if possible

import pandas as pd

def make_series(categories, repeats):
    categories = [str(i) for i in range(categories)]
    return pd.Series(categories*repeats).astype('category')

def series_info(series):
    true_size = series.nbytes
    n = len(series.cat.categories)
    m = len(series.cat.codes)
    expected_size = n*8+m  # Note: not n*m*constant
    print('Number of categories (n): %d, length of data (m): %d, bytes: %d, '
          '8n+m: %d, nm: %d' % (n, m, true_size, expected_size, n*m))

series_info(make_series(123, 456))

# Output:
#  Number of categories (n): 123, length of data (m): 56088, bytes: 57072, 8n+m: 57072, nm: 6898824
#
# Note that 8n+m matches true size as it should. nm is much larger.

Problem description

The documentation says:

The memory usage of a Categorical is proportional to the number of categories times the length of the data.

If there are n categories and the length of data is m, this means that memory usage is O(nm). As the script above shows however, the true usage is O(n+m). The claimed usage O(nm) suggests that the implementation would store the categories using one-hot encoding, which would be strange to say the least.

jreback · 2017-09-28T11:41:15Z

well if you want to issue a PR to correct the documentation that would be fine.

berkay-dincer · 2017-10-02T06:38:34Z

@jreback @RauliRuohonen created PR #17736 to solve this issue.

jreback added Categorical Categorical Data Type Difficulty Novice Docs labels Sep 28, 2017

jreback added this to the Next Major Release milestone Sep 28, 2017

berkay-dincer mentioned this issue Oct 2, 2017

Fixed the memory usage explanation of categorical in gotchas from O(n… #17736

Merged

1 task

jreback modified the milestones: Next Major Release, 0.21.0 Oct 2, 2017

jreback closed this as completed in #17736 Oct 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Categorical gotchas says memory usage is O(nm), when it is actually O(n+m) #17705

DOC: Categorical gotchas says memory usage is O(nm), when it is actually O(n+m) #17705

RauliRuohonen commented Sep 28, 2017

jreback commented Sep 28, 2017

berkay-dincer commented Oct 2, 2017

DOC: Categorical gotchas says memory usage is O(nm), when it is actually O(n+m) #17705

DOC: Categorical gotchas says memory usage is O(nm), when it is actually O(n+m) #17705

Comments

RauliRuohonen commented Sep 28, 2017

Code Sample, a copy-pastable example if possible

Problem description

jreback commented Sep 28, 2017

berkay-dincer commented Oct 2, 2017