Performance drop and MemoryError during insert and _consolidate_inplace #26985

ololobus · 2019-06-21T15:39:48Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

n = 2000000
data = pd.DataFrame({'a' : range(n)})
for i in range(1, 100):
    data['col_' + str(i)] = np.random.choice(['a', 'b'], n)

for i in range(1, 600):
    data['test_{}'.format(i)] = i
    print(str(i))

Problem description

Following this StackOverflow question.

I run this code sample on Ubuntu 18.04 LTS machine with 16 GB of RAM and 2 GB Swap. Execution produces following stacktrace:

294
295
296
297
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2657, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1053, in set
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2659, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "py-memory-test.py", line 12, in <module>
    data['test_{}'.format(i)] = i
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3370, in __setitem__
    self._set_item(key, value)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3446, in _set_item
    NDFrame._set_item(self, key, value)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3172, in _set_item
    self._data.set(key, value)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1056, in set
    self.insert(len(self.items), item, value)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1184, in insert
    self._consolidate_inplace()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1899, in _consolidate
    _can_consolidate=_can_consolidate)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 3149, in _merge_blocks
    new_values = new_values[argsort]
MemoryError

I have found following code inside pandas core:

def insert(self, loc, item, value, allow_duplicates=False):
    ...
    self._known_consolidated = False

    if len(self.blocks) > 100:
        self._consolidate_inplace()

It seems that this consolidation process takes place every ~100th iteration and substantially affects performance and memory usage. In order to proof this hypothesis I have tried to modify 100 to 1000000 and it worked just fine, no performance gaps, no MemoryError.

It looks quite weird from my perspective, since 'consolidation' sounds like it should reduce memory usage. Probably pandas should allocate some private Swap files (e.g. via mmap) if it is running out RAM+SystemSwap in order to be able to successfully complete consolidation process.

Expected Output

Without substantial freezes every ~100th iteration and MemoryError.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-50-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-06-21T19:40:20Z

I'm not sure there's any concrete action we can take on this right now. Rewriting the block manager is a long term TODO (search for block consolidation), but it'll likely be a while.

ololobus · 2019-06-22T12:27:36Z

Could not find anything similar to this issue and with 'consolidation' keyword or relevant to 'block manager' using GH search. There are a few points inside old pandas 1.0 roadmap, but they are rather vague.

Anyway, do you have some thesis regarding desired block manager architecture? Or is it still to be figured out?

TomAugspurger · 2019-06-24T14:01:39Z

Still to be figured out.

…

On Sat, Jun 22, 2019 at 7:27 AM Alexey Kondratov ***@***.***> wrote: Could not find anything similar to this issue and with 'consolidation' keyword or relevant to 'block manager' using GH search <https://github.com/pandas-dev/pandas/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+consolidation+>. There are a few points inside old pandas 1.0 roadmap <#10000>, but they are rather vague. Anyway, do you have some thesis regarding desired block manager architecture? Or is it still to be figured out? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26985?email_source=notifications&email_token=AAKAOIQTUQIWVZUVZMMUJ3LP3YLD7A5CNFSM4H2SZBK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYKINZQ#issuecomment-504661734>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOITEPVP656RPNXGSMHLP3YLD7ANCNFSM4H2SZBKQ> .

metariat · 2019-06-28T15:23:01Z

What's the impact if I modify the line if len(self.blocks) > 100: to if len(self.blocks) > 10000: ?

TomAugspurger · 2019-06-29T21:17:33Z

I'm not sure.

…

On Fri, Jun 28, 2019 at 10:23 AM metariat ***@***.***> wrote: What's the impact if I modify the line if len(self.blocks) > 100: to if len(self.blocks) > 10000: ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26985?email_source=notifications&email_token=AAKAOIWFQVN7OGVG4NNFWR3P4YUFZA5CNFSM4H2SZBK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY2MDLQ#issuecomment-506773934>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIW2L4J4UOOZXM53QDLP4YUFZANCNFSM4H2SZBKQ> .

luohaoasu · 2019-11-08T17:40:54Z

got almost same error. MemoryError

dmitra79 · 2020-07-16T00:19:58Z

I have a similar issue. I am looping over a list of files of same size, and performing some operations. on each. This error happens ~every 9th file... Almost all variables are inside the loop, so getting this error is very strange.

Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-45-generic x86_64)
pandas 0.24.2 py37he6710b0_0

jbrockmendel added the Performance Memory or execution speed performance label Jul 21, 2019

jbrockmendel added the Internals Related to non-user accessible pandas implementation label Sep 22, 2020

This was referenced Dec 8, 2020

BUG: item_cache invalidation #38351

Merged

BUG: item_cache invalidation on DataFrame.insert #38380

Merged

jreback added this to the 1.3 milestone Dec 11, 2020

jreback closed this as completed in #38380 Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance drop and MemoryError during insert and _consolidate_inplace #26985

Performance drop and MemoryError during insert and _consolidate_inplace #26985

ololobus commented Jun 21, 2019

INSTALLED VERSIONS

TomAugspurger commented Jun 21, 2019

ololobus commented Jun 22, 2019

TomAugspurger commented Jun 24, 2019 via email

metariat commented Jun 28, 2019

TomAugspurger commented Jun 29, 2019 via email

luohaoasu commented Nov 8, 2019

dmitra79 commented Jul 16, 2020

Performance drop and MemoryError during insert and _consolidate_inplace #26985

Performance drop and MemoryError during insert and _consolidate_inplace #26985

Comments

ololobus commented Jun 21, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jun 21, 2019

ololobus commented Jun 22, 2019

TomAugspurger commented Jun 24, 2019 via email

metariat commented Jun 28, 2019

TomAugspurger commented Jun 29, 2019 via email

luohaoasu commented Nov 8, 2019

dmitra79 commented Jul 16, 2020

Output of `pd.show_versions()`