MultiIndex takes up a huge amount of storage space #5247

anmyachev · 2022-11-21T16:54:12Z

The situation worsens when the flow code is filled with a large number of single-column inserts, resulting in a large number of column partitions, each of which stores its own version of the multiIndex.

Solutions to this problem can be:

reduce the number of partitions using environment variables
when creating a new partition, store as many columns as possible in it (on the other hand, the number of column partitions must be greater than one to parallelize calculations). This requires modification of operations such as setitem/insert and so on.
store internal dataframes with range indexes (just like placeholders) and only for the duration of operations set them to a valid index, which will need to be stored somewhere.

Code to reproduce:

import modin.pandas as pd
import numpy as np
import ray

np.random.seed(42)

count_rows = 10**6
nonunique_ratio = 0.7

base_strings = ["long_string_dataaaaaaaaaaaaaaaa{}", "cat{}"]
arrays = [None, None]
for idx in range(len(arrays)):
    nonunique_count = int(nonunique_ratio * count_rows)
    data = [base_strings[idx].format(x) for x in range(nonunique_count // 2)]
    nonunique_data = np.append(np.array(data), np.array(data))
    unique_data = np.array(
        [base_strings[idx].format(x) for x in range(nonunique_count, count_rows)]
    )
    arrays[idx] = np.append(nonunique_data, unique_data)
    np.random.shuffle(arrays[idx])

index = pd.MultiIndex.from_arrays(arrays)
print(index)

# initialize ray as Modin does
_ = pd.DataFrame([1, 2])
# `ray memory` before all explicit put operations
# --- Aggregate object store stats across all nodes ---
# Plasma memory usage 0 MiB, 4 objects,

# put in storage by part
count_cpus = 100
one_part = count_rows // count_cpus
refs = [None] * count_cpus
for idx in range(count_cpus):
    refs[idx] = ray.put(
        index[idx * one_part : (idx + 1) * one_part]
    )  # it takes ~ 3210 MiB in storage

# `ray memory` output after several ray.put operations
# --- Aggregate object store stats across all nodes ---
# Plasma memory usage 3210 MiB, 104 objects,

# put in storage at once
ref = ray.put(index)  # it takes ~ 40 MiB in storage
# `ray memory` output after last put operation
# --- Aggregate object store stats across all nodes ---
# Plasma memory usage 3250 MiB, 105 objects,

jbrockmendel · 2022-11-28T20:07:07Z

store internal dataframes with range indexes

+1. ideally in pandas we'll refactor the Manager classes to not have the axes at all (xref pandas #48126) and subsequently modin partitions can hold Managers instead of DataFrames.

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2023-02-07T13:32:09Z

As a temporary solution, we could try using MultiIndex.remove_unused_levels() when chunking a MultiIndex. It has to remove unused values from the encoding table so each partition would store only that part of the table that belongs to this partition.

I've made the following changes to the script above and measured ray memory on the same checkpoints.

for idx in range(count_cpus):
    refs[idx] = ray.put(
-        index[idx * one_part : (idx + 1) * one_part]
+        index[idx * one_part : (idx + 1) * one_part].remove_unused_levels()
    )  # it takes ~ 3210 MiB in storage

checkpoint №	w/o dropping unused levels	w/ dropping unused levels
№1	0 Mib, 4 objs, 0% full	0 Mib, 4 objs, 0% full
№2	3210 Mib, 104 objs, 1.68% full	52 MiB, 104 objs, 0.03% full
№3	3250 Mib, 105 objs, 1.7% full	92 MiB, 105 objs, 0.05% full

p.s. it also obviously works much faster cause putting lighter objects into the plasma store

I'm wondering what are the disadvantages of using .remove_unused_levels() then, I mean, why isn't this done automatically by pandas? @jbrockmendel maybe you know?

jbrockmendel · 2023-02-07T15:45:38Z

I'm wondering what are the disadvantages of using .remove_unused_levels() then, I mean, why isn't this done automatically by pandas?

Real reason is most likely that it hasn't been suggested. Only downside that comes to mind is if you split a MultiIndex, then drop_unused_categories, then want to compare/concat/setop the results, it is more efficient to have known-matching-levels.

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev added question ❓ Questions about Modin Memory 💾 Issues related to memory Ray ⚡ Issues related to the Ray engine P1 Important tasks that we should complete soon labels Nov 21, 2022

anmyachev mentioned this issue Dec 7, 2022

FEAT-#5367: Introduce new API for repartitioning Modin objects #5366

Merged

7 tasks

dchigarev added a commit that referenced this issue Feb 7, 2023

PERF-#5247: Make MultiIndex use memory more efficiently

425c1db

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev mentioned this issue Feb 7, 2023

PERF-#5247: Decrease memory consumption for MultiIndex #5632

Merged

7 tasks

YarShev closed this as completed in #5632 Mar 1, 2023

YarShev pushed a commit that referenced this issue Mar 1, 2023

PERF-#5247: Make MultiIndex use memory more efficiently (#5632)

6afee33

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiIndex takes up a huge amount of storage space #5247

MultiIndex takes up a huge amount of storage space #5247

anmyachev commented Nov 21, 2022

jbrockmendel commented Nov 28, 2022

dchigarev commented Feb 7, 2023

jbrockmendel commented Feb 7, 2023

MultiIndex takes up a huge amount of storage space #5247

MultiIndex takes up a huge amount of storage space #5247

Comments

anmyachev commented Nov 21, 2022

jbrockmendel commented Nov 28, 2022

dchigarev commented Feb 7, 2023

jbrockmendel commented Feb 7, 2023