Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop and MemoryError during insert and _consolidate_inplace #26985

Closed
ololobus opened this issue Jun 21, 2019 · 7 comments · Fixed by #38380
Closed

Performance drop and MemoryError during insert and _consolidate_inplace #26985

ololobus opened this issue Jun 21, 2019 · 7 comments · Fixed by #38380
Labels
Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Milestone

Comments

@ololobus
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

n = 2000000
data = pd.DataFrame({'a' : range(n)})
for i in range(1, 100):
    data['col_' + str(i)] = np.random.choice(['a', 'b'], n)

for i in range(1, 600):
    data['test_{}'.format(i)] = i
    print(str(i))

Problem description

Following this StackOverflow question.

I run this code sample on Ubuntu 18.04 LTS machine with 16 GB of RAM and 2 GB Swap. Execution produces following stacktrace:

294
295
296
297
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2657, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1053, in set
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2659, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "py-memory-test.py", line 12, in <module>
    data['test_{}'.format(i)] = i
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3370, in __setitem__
    self._set_item(key, value)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3446, in _set_item
    NDFrame._set_item(self, key, value)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3172, in _set_item
    self._data.set(key, value)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1056, in set
    self.insert(len(self.items), item, value)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1184, in insert
    self._consolidate_inplace()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1899, in _consolidate
    _can_consolidate=_can_consolidate)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 3149, in _merge_blocks
    new_values = new_values[argsort]
MemoryError

I have found following code inside pandas core:

def insert(self, loc, item, value, allow_duplicates=False):
    ...
    self._known_consolidated = False

    if len(self.blocks) > 100:
        self._consolidate_inplace()

It seems that this consolidation process takes place every ~100th iteration and substantially affects performance and memory usage. In order to proof this hypothesis I have tried to modify 100 to 1000000 and it worked just fine, no performance gaps, no MemoryError.

It looks quite weird from my perspective, since 'consolidation' sounds like it should reduce memory usage. Probably pandas should allocate some private Swap files (e.g. via mmap) if it is running out RAM+SystemSwap in order to be able to successfully complete consolidation process.

Expected Output

1
2
3
4
5
...
599

Without substantial freezes every ~100th iteration and MemoryError.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-50-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

I'm not sure there's any concrete action we can take on this right now. Rewriting the block manager is a long term TODO (search for block consolidation), but it'll likely be a while.

@ololobus
Copy link
Author

Could not find anything similar to this issue and with 'consolidation' keyword or relevant to 'block manager' using GH search. There are a few points inside old pandas 1.0 roadmap, but they are rather vague.

Anyway, do you have some thesis regarding desired block manager architecture? Or is it still to be figured out?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 24, 2019 via email

@metariat
Copy link

What's the impact if I modify the line if len(self.blocks) > 100: to if len(self.blocks) > 10000: ?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 29, 2019 via email

@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Jul 21, 2019
@luohaoasu
Copy link

got almost same error. MemoryError

@dmitra79
Copy link

I have a similar issue. I am looping over a list of files of same size, and performing some operations. on each. This error happens ~every 9th file... Almost all variables are inside the loop, so getting this error is very strange.

Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-45-generic x86_64)
pandas 0.24.2 py37he6710b0_0

@jbrockmendel jbrockmendel added the Internals Related to non-user accessible pandas implementation label Sep 22, 2020
@jreback jreback added this to the 1.3 milestone Dec 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants