Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex takes up a huge amount of storage space #5247

Closed
anmyachev opened this issue Nov 21, 2022 · 3 comments · Fixed by #5632
Closed

MultiIndex takes up a huge amount of storage space #5247

anmyachev opened this issue Nov 21, 2022 · 3 comments · Fixed by #5632
Labels
Memory 💾 Issues related to memory P1 Important tasks that we should complete soon question ❓ Questions about Modin Ray ⚡ Issues related to the Ray engine

Comments

@anmyachev
Copy link
Collaborator

The situation worsens when the flow code is filled with a large number of single-column inserts, resulting in a large number of column partitions, each of which stores its own version of the multiIndex.

Solutions to this problem can be:

  • reduce the number of partitions using environment variables
  • when creating a new partition, store as many columns as possible in it (on the other hand, the number of column partitions must be greater than one to parallelize calculations). This requires modification of operations such as setitem/insert and so on.
  • store internal dataframes with range indexes (just like placeholders) and only for the duration of operations set them to a valid index, which will need to be stored somewhere.

Code to reproduce:

import modin.pandas as pd
import numpy as np
import ray

np.random.seed(42)

count_rows = 10**6
nonunique_ratio = 0.7

base_strings = ["long_string_dataaaaaaaaaaaaaaaa{}", "cat{}"]
arrays = [None, None]
for idx in range(len(arrays)):
    nonunique_count = int(nonunique_ratio * count_rows)
    data = [base_strings[idx].format(x) for x in range(nonunique_count // 2)]
    nonunique_data = np.append(np.array(data), np.array(data))
    unique_data = np.array(
        [base_strings[idx].format(x) for x in range(nonunique_count, count_rows)]
    )
    arrays[idx] = np.append(nonunique_data, unique_data)
    np.random.shuffle(arrays[idx])

index = pd.MultiIndex.from_arrays(arrays)
print(index)

# initialize ray as Modin does
_ = pd.DataFrame([1, 2])
# `ray memory` before all explicit put operations
# --- Aggregate object store stats across all nodes ---
# Plasma memory usage 0 MiB, 4 objects,

# put in storage by part
count_cpus = 100
one_part = count_rows // count_cpus
refs = [None] * count_cpus
for idx in range(count_cpus):
    refs[idx] = ray.put(
        index[idx * one_part : (idx + 1) * one_part]
    )  # it takes ~ 3210 MiB in storage

# `ray memory` output after several ray.put operations
# --- Aggregate object store stats across all nodes ---
# Plasma memory usage 3210 MiB, 104 objects,

# put in storage at once
ref = ray.put(index)  # it takes ~ 40 MiB in storage
# `ray memory` output after last put operation
# --- Aggregate object store stats across all nodes ---
# Plasma memory usage 3250 MiB, 105 objects,
@anmyachev anmyachev added question ❓ Questions about Modin Memory 💾 Issues related to memory Ray ⚡ Issues related to the Ray engine P1 Important tasks that we should complete soon labels Nov 21, 2022
@jbrockmendel
Copy link
Collaborator

store internal dataframes with range indexes

+1. ideally in pandas we'll refactor the Manager classes to not have the axes at all (xref pandas #48126) and subsequently modin partitions can hold Managers instead of DataFrames.

dchigarev added a commit that referenced this issue Feb 7, 2023
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@dchigarev
Copy link
Collaborator

As a temporary solution, we could try using MultiIndex.remove_unused_levels() when chunking a MultiIndex. It has to remove unused values from the encoding table so each partition would store only that part of the table that belongs to this partition.

I've made the following changes to the script above and measured ray memory on the same checkpoints.

for idx in range(count_cpus):
    refs[idx] = ray.put(
-        index[idx * one_part : (idx + 1) * one_part]
+        index[idx * one_part : (idx + 1) * one_part].remove_unused_levels()
    )  # it takes ~ 3210 MiB in storage
checkpoint № w/o dropping unused levels w/ dropping unused levels
№1 0 Mib, 4 objs, 0% full 0 Mib, 4 objs, 0% full
№2 3210 Mib, 104 objs, 1.68% full 52 MiB, 104 objs, 0.03% full
№3 3250 Mib, 105 objs, 1.7% full 92 MiB, 105 objs, 0.05% full

p.s. it also obviously works much faster cause putting lighter objects into the plasma store

I'm wondering what are the disadvantages of using .remove_unused_levels() then, I mean, why isn't this done automatically by pandas? @jbrockmendel maybe you know?

@jbrockmendel
Copy link
Collaborator

I'm wondering what are the disadvantages of using .remove_unused_levels() then, I mean, why isn't this done automatically by pandas?

Real reason is most likely that it hasn't been suggested. Only downside that comes to mind is if you split a MultiIndex, then drop_unused_categories, then want to compare/concat/setop the results, it is more efficient to have known-matching-levels.

YarShev pushed a commit that referenced this issue Mar 1, 2023
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Memory 💾 Issues related to memory P1 Important tasks that we should complete soon question ❓ Questions about Modin Ray ⚡ Issues related to the Ray engine
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants