-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 #35499
Comments
Thanks @ldacey for the report. I can confirm this was working in 1.0.5. This change in behaviour is due to #33465 cc @topper-123
|
The problem is the array conversion. Your example can be boiled down further to: >>> data = ["a" * 131880] * 201368
>>> data = np.array(data, dtype=object)
>>> np.asarray(data, str)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880 The problem is we go from a small list (memorywise) to a large list, because The call to >>> data = [0] + ["a" * 131880] * 201368 # data[0] is not a string
>>> data = np.array(data, dtype=object)
>>> data = np.asarray(data, str) # used to ensure data[0] converts to str
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201369,) and data type <U131880 Is there a way to avoid this/make it more efficient? I've got a feeling that this is pushiong against the limits of what numpy can do with strings. EDIT: added an example of why ``asarray(data, str)``` is used. |
@idacey, an alternative to your usecase coulds be to use StringArray, that bypasses the conversion, i.e.:
|
Just as a note: I would not expect converting a list of many references to the same object to a StringArray to necessarily preserve that fact once it's converted to StringArray's internal storage. If / when we use Arrow for storing string memory that wouldn't be possible. |
That is probably not what the OP is doing. I used that as a reproducible code sample. I think the issue is that the memory needed is proportional to the number of rows times the longest string. (all rows could be different) |
As far as what I am doing: My current code has a pyarrow_schema that I defined. For any column I declared as pa.string(), my script would convert the data to .astype("string"). I tried to be explicit with these types because I faced issues with inconsistent schemas when reading data from the parquet files downstream (related to null data for the most part).
For now, I switched back to .astype(str) instead since that runs on the current version of pandas for this data. |
maybe a more representative MRE
|
Yeah I agree, but the current (simplified) conversion chain |
@simonjayhawkins , both your examples fail if >>> data = np.full(201368, fill_value=np.nan, dtype=object)
>>>
>>> data[0] = "a" * 131880
>>> df = pd.DataFrame(data, dtype=str)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880 In some sense, that dtype="string" worked in v.1.0 was " just luck", and we should fix this problem in both the |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
I tried to pinpoint the specific row which causes the error. The column has HTML data like this:
'''<div class="comment" dir="auto"><p dir="auto">Request <a href="/agent/tickets/test" target="_blank" rel="ticket">#test</a> "RE: ** M ..." Last comment in request </a>:</p> <p dir="auto"></p> <p dir="auto">Thank you</p> <p dir="auto">Stuff.</p> <p dir="auto">We will keep you posted .</p> <p dir="auto">Regards,</p> <p dir="auto">Name</p></div>'''
Problem description
I have code which has been converting columns to the "string" dtype and this has worked up until pandas 1.1.0
For example, I tried to process a file which I successfully processed in April and it works when I use .astype(str), but it fails when I use .astype("string") event though this worked in pandas 1.0.5.
The column does not need to be the new "string" type, but I wanted to raise this issue anyways.
Rows: 201368
Empty/null rows for the column in question: 189014 / 201368
So this column is quite sparse, and as I mentioned below if I filtered the nulls and then do .astype("string") then it will run fine. I am not sure why this worked before (same server, 64 GB of RAM), and this file was previous processed as a "string" before the update.
Error:
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0
The text was updated successfully, but these errors were encountered: