Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv fails with TypeError: object cannot be converted to an IntegerDtype yet succeeds when reading chunks #25472

Closed
teto opened this issue Feb 28, 2019 · 20 comments · Fixed by #43949 or obrasier/cricketstats#8
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv
Milestone

Comments

@teto
Copy link

teto commented Feb 28, 2019

Code Sample, a copy-pastable example if possible

Download this file upload.txt

# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto
import argparse

# I attached the file in the github issue
filename = "upload.txt"
# this field is coded on 64 bits so 'UInt64' looks perfect.
column = "tcp.options.mptcp.sendkey"

with open(filename) as fd:

    print("READ CHUNK BY CHUNK")

    res = pd.read_csv(
            fd,
            comment='#',
            sep='|',
            dtype={column: 'UInt64' },
            usecols=[column],
            chunksize=1
    )
    for chunk in (res):
        # print("chunk %d" % i)
        print(chunk)



    fd.seek(0) # rewind

    print("READ THE WHOLE FILE AT ONCE ")
    res = pd.read_csv(
            fd,
            comment='#',
            sep='|',
            usecols=[column],
            dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
    )
    print(res)



If I read in chunks, read_csv succeeds, if I try to read the column at once, I get

Traceback (most recent call last):
  File "test2.py", line 34, in <module>
    dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 900, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 992, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1124, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1155, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1235, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 308, in _from_sequence_of_strings
    return cls._from_sequence(scalars, dtype, copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 303, in _from_sequence
    return integer_array(scalars, dtype=dtype, copy=copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 111, in integer_array
    values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 188, in coerce_to_array
    values.dtype))
TypeError: object cannot be converted to an IntegerDtype


Expected Output

I would like the call to read_csv to succeed without having to read in chunks (which seems to have other side effects as well).

Output of pd.show_versions()

I am using v0.23.4 with a patch from master to fix some other bug. [paste the output of ``pd.show_versions()`` here below this line] commit: None python: 3.7.2.final.0 python-bits: 64 OS: Linux OS-release: 4.19.0 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0+unknown
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

Have you been able to narrow down the cause? Possibly start reading the first n rows, and then bisect from there, to see what line causes the failure?

@teto
Copy link
Author

teto commented Feb 28, 2019

That s part of the difficulty, depending on the chunk size the exception is raised or not. With a size of one, it succeeds. Bigger, the read fails, and i don t get why

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 28, 2019 via email

@gfyoung gfyoung added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels Mar 1, 2019
@gfyoung
Copy link
Member

gfyoung commented Mar 1, 2019

Also, if you are able to share a file that can reproduce the issue, that would be great.

@teto
Copy link
Author

teto commented Mar 1, 2019

Sorry I definitely had uploaded it but I may have messed up somewhere, it ended up not being visible, anyway I've put the file in the first post (upload.txt but it's a csv really). I think it's a bug because readling line by line, no value appears as a problem. the .csv file is generated so there should be no error in the values either.

@mrimal
Copy link

mrimal commented Mar 1, 2019

When I tried to use your code to read the file, most of the values in the column showed up as missing which might be the reason it's not reading as 'UInt64'. Reading it as default format and/or string works.

@teto
Copy link
Author

teto commented Mar 1, 2019

I actually updated to pandas 0.24.1 because it supported empty rows via UInt64 (else why would it work when readling line by line). 'UInt64' also works for other columns with empty values, there are just some columns for which it doesn't and I can't fathom why.

@TomAugspurger
Copy link
Contributor

Have you had a chance to debug this @teto?

@teto
Copy link
Author

teto commented Mar 7, 2019

I am not sure what else I can do, I've provided the data file and a standalone example.
If it reads several items, it fails, if just one at a time it works. Seems like a bug to me and pandas is too complex to be able to just dive and fix the bug for casual user like me.

@TomAugspurger
Copy link
Contributor

Gotcha, hopefully someone has time to take a look, but you may be the expert here as this is fairly new.

cc @kprestel who implemented EA support for read_csv.

teto added a commit to teto/pymptcpanalyzer that referenced this issue Mar 11, 2019
because it was comparing values of different types.
For now I encode the failing fields as str instead of UInt64
(dsnraw seems concerned as well)
see pandas-dev/pandas#25472 for more details
@kprestel
Copy link
Contributor

I'll be able to take a look at this tonight hopefully.

teto added a commit to teto/pymptcpanalyzer that referenced this issue Apr 16, 2019
because it was comparing values of different types.
For now I encode the failing fields as str instead of UInt64
(dsnraw seems concerned as well)
see pandas-dev/pandas#25472 for more details
@NumesSanguis
Copy link

NumesSanguis commented Apr 16, 2019

Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge.

I'm running into the same problem as OP when I read 1 of the sheets of a .xlsl file (pandas 0.24.2).
There are NaN values, but from pandas 0.24 that should work when doing .astype(pd.Int16Dtype()) right?

This gave the same problem as OP:

df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())

However, ugly, but this seemed to have worked for me:

df_sheet.age = df_sheet.age.astype('float')  # first convert to float before int
df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())

@mroeschke mroeschke added ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Info Clarification about behavior needed to assess issue labels Mar 8, 2020
@lukestanbra
Copy link

I just ran into this - it looks much more general than a read_csv problem to me.

>>> pd.Series(["1", "2", "3"]).astype(pd.Int64Dtype())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5698, in astype
    new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 582, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 442, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 625, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 821, in astype_nansafe
    return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 354, in _from_sequence
    return integer_array(scalars, dtype=dtype, copy=copy)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 135, in integer_array
    values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 218, in coerce_to_array
    raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
TypeError: object cannot be converted to an IntegerDtype

I would expect that this should just work? As @NumesSanguis says above, converting via float does work, e.g.

>>> pd.Series(["1", "2", "3"]).astype(float).astype(pd.Int64Dtype())
0    1
1    2
2    3
dtype: Int64

This is using

>>> pd.__version__
'1.0.3'

@TomAugspurger - do you think a new issue needs to be opened for this?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 19, 2020 via email

@lukestanbra
Copy link

OK - that's good to know. Gets a bit too into the internals for me to follow, but was interesting to see how you all talk about this kind of stuff. If anyone else stumbles across this the relevant issues are 33254, 32586 and 33607

@mroeschke mroeschke added the Bug label May 21, 2020
@dekiesel
Copy link

@NumesSanguis

Any ideas for a workaround if the integer (18 places) is too big for float64?

@NumesSanguis
Copy link

@dekiesel
Sorry I don't know

@alexreg
Copy link
Contributor

alexreg commented Oct 9, 2021

Still no news about this? It seems like quite a significant bug, and has been open an extremely long time!

@jreback
Copy link
Contributor

jreback commented Oct 9, 2021

@alexreg you or anyone is welcome to submit a PR to patch and the core team can review

@alexreg
Copy link
Contributor

alexreg commented Oct 9, 2021

@jreback I'm not sure I'm a good person to analyse the root of this problem, but I'll have a look anyway, and if I can figure it out, will submit a PR.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 9, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 9, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 10, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 10, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 17, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 18, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 18, 2021
@jreback jreback added this to the 1.4 milestone Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv
Projects
None yet