You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importpandasaspddefmake_series(categories, repeats):
categories= [str(i) foriinrange(categories)]
returnpd.Series(categories*repeats).astype('category')
defseries_info(series):
true_size=series.nbytesn=len(series.cat.categories)
m=len(series.cat.codes)
expected_size=n*8+m# Note: not n*m*constantprint('Number of categories (n): %d, length of data (m): %d, bytes: %d, ''8n+m: %d, nm: %d'% (n, m, true_size, expected_size, n*m))
series_info(make_series(123, 456))
# Output:# Number of categories (n): 123, length of data (m): 56088, bytes: 57072, 8n+m: 57072, nm: 6898824## Note that 8n+m matches true size as it should. nm is much larger.
The memory usage of a Categorical is proportional to the number of categories times the length of the data.
If there are n categories and the length of data is m, this means that memory usage is O(nm). As the script above shows however, the true usage is O(n+m). The claimed usage O(nm) suggests that the implementation would store the categories using one-hot encoding, which would be strange to say the least.
The text was updated successfully, but these errors were encountered:
Code Sample, a copy-pastable example if possible
Problem description
The documentation says:
If there are n categories and the length of data is m, this means that memory usage is O(nm). As the script above shows however, the true usage is O(n+m). The claimed usage O(nm) suggests that the implementation would store the categories using one-hot encoding, which would be strange to say the least.
The text was updated successfully, but these errors were encountered: