read_csv fails with `TypeError: object cannot be converted to an IntegerDtype` yet succeeds when reading chunks #25472

teto · 2019-02-28T08:35:29Z

Code Sample, a copy-pastable example if possible

Download this file upload.txt

# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto
import argparse

# I attached the file in the github issue
filename = "upload.txt"
# this field is coded on 64 bits so 'UInt64' looks perfect.
column = "tcp.options.mptcp.sendkey"

with open(filename) as fd:

    print("READ CHUNK BY CHUNK")

    res = pd.read_csv(
            fd,
            comment='#',
            sep='|',
            dtype={column: 'UInt64' },
            usecols=[column],
            chunksize=1
    )
    for chunk in (res):
        # print("chunk %d" % i)
        print(chunk)



    fd.seek(0) # rewind

    print("READ THE WHOLE FILE AT ONCE ")
    res = pd.read_csv(
            fd,
            comment='#',
            sep='|',
            usecols=[column],
            dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
    )
    print(res)

If I read in chunks, read_csv succeeds, if I try to read the column at once, I get

Traceback (most recent call last):
  File "test2.py", line 34, in <module>
    dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 900, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 992, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1124, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1155, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1235, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 308, in _from_sequence_of_strings
    return cls._from_sequence(scalars, dtype, copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 303, in _from_sequence
    return integer_array(scalars, dtype=dtype, copy=copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 111, in integer_array
    values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 188, in coerce_to_array
    values.dtype))
TypeError: object cannot be converted to an IntegerDtype

Expected Output

I would like the call to read_csv to succeed without having to read in chunks (which seems to have other side effects as well).

Output of `pd.show_versions()`

I am using v0.23.4 with a patch from master to fix some other bug. [paste the output of ``pd.show_versions()`` here below this line] commit: None python: 3.7.2.final.0 python-bits: 64 OS: Linux OS-release: 4.19.0 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0+unknown
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-02-28T13:08:35Z

Have you been able to narrow down the cause? Possibly start reading the first n rows, and then bisect from there, to see what line causes the failure?

teto · 2019-02-28T13:53:46Z

That s part of the difficulty, depending on the chunk size the exception is raised or not. With a size of one, it succeeds. Bigger, the read fails, and i don t get why

TomAugspurger · 2019-02-28T13:56:47Z

I suspect a specific value in the CSV is causing that. I'd recommend trying with different values of `nrows` to see what that value is.

…

On Thu, Feb 28, 2019 at 7:53 AM Matthieu Coudron ***@***.***> wrote: That s part of the difficulty, depending on the chunk size the exception is raised or not. With a size of one, it succeeds. Bigger, the read fails, and i don t get why — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#25472 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIuInlFKIjkT6d73ZH9iwZOOalBOvks5vR99wgaJpZM4bWRmS> .

gfyoung · 2019-03-01T01:39:22Z

Also, if you are able to share a file that can reproduce the issue, that would be great.

teto · 2019-03-01T02:38:40Z

Sorry I definitely had uploaded it but I may have messed up somewhere, it ended up not being visible, anyway I've put the file in the first post (upload.txt but it's a csv really). I think it's a bug because readling line by line, no value appears as a problem. the .csv file is generated so there should be no error in the values either.

mrimal · 2019-03-01T05:23:07Z

When I tried to use your code to read the file, most of the values in the column showed up as missing which might be the reason it's not reading as 'UInt64'. Reading it as default format and/or string works.

teto · 2019-03-01T07:20:23Z

I actually updated to pandas 0.24.1 because it supported empty rows via UInt64 (else why would it work when readling line by line). 'UInt64' also works for other columns with empty values, there are just some columns for which it doesn't and I can't fathom why.

TomAugspurger · 2019-03-05T15:27:15Z

Have you had a chance to debug this @teto?

teto · 2019-03-07T09:44:44Z

I am not sure what else I can do, I've provided the data file and a standalone example.
If it reads several items, it fails, if just one at a time it works. Seems like a bug to me and pandas is too complex to be able to just dive and fix the bug for casual user like me.

TomAugspurger · 2019-03-07T12:18:27Z

Gotcha, hopefully someone has time to take a look, but you may be the expert here as this is fairly new.

cc @kprestel who implemented EA support for read_csv.

because it was comparing values of different types. For now I encode the failing fields as str instead of UInt64 (dsnraw seems concerned as well) see pandas-dev/pandas#25472 for more details

kprestel · 2019-03-13T13:01:29Z

I'll be able to take a look at this tonight hopefully.

because it was comparing values of different types. For now I encode the failing fields as str instead of UInt64 (dsnraw seems concerned as well) see pandas-dev/pandas#25472 for more details

NumesSanguis · 2019-04-16T08:09:47Z

Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge.

I'm running into the same problem as OP when I read 1 of the sheets of a .xlsl file (pandas 0.24.2).
There are NaN values, but from pandas 0.24 that should work when doing .astype(pd.Int16Dtype()) right?

This gave the same problem as OP:

df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())

However, ugly, but this seemed to have worked for me:

df_sheet.age = df_sheet.age.astype('float')  # first convert to float before int
df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())

lukestanbra · 2020-05-19T11:15:20Z

I just ran into this - it looks much more general than a read_csv problem to me.

>>> pd.Series(["1", "2", "3"]).astype(pd.Int64Dtype())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5698, in astype
    new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 582, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 442, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 625, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 821, in astype_nansafe
    return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 354, in _from_sequence
    return integer_array(scalars, dtype=dtype, copy=copy)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 135, in integer_array
    values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 218, in coerce_to_array
    raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
TypeError: object cannot be converted to an IntegerDtype

I would expect that this should just work? As @NumesSanguis says above, converting via float does work, e.g.

>>> pd.Series(["1", "2", "3"]).astype(float).astype(pd.Int64Dtype())
0    1
1    2
2    3
dtype: Int64

This is using

>>> pd.__version__
'1.0.3'

@TomAugspurger - do you think a new issue needs to be opened for this?

TomAugspurger · 2020-05-19T11:18:53Z

I thought we already had an issue for that (possible search for "strictness of _from_sequence") but I may be wrong.

…

On Tue, May 19, 2020 at 6:15 AM Luke Stanbra ***@***.***> wrote: I just ran into this - it looks much more general than a read_csv problem to me. >>> pd.Series(["1", "2", "3"]).astype(pd.Int64Dtype()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5698, in astype new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors) File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 582, in astype return self.apply("astype", dtype=dtype, copy=copy, errors=errors) File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 442, in apply applied = getattr(b, f)(**kwargs) File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 625, in astype values = astype_nansafe(vals1d, dtype, copy=True) File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 821, in astype_nansafe return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy) File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 354, in _from_sequence return integer_array(scalars, dtype=dtype, copy=copy) File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 135, in integer_array values, mask = coerce_to_array(values, dtype=dtype, copy=copy) File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 218, in coerce_to_array raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype") TypeError: object cannot be converted to an IntegerDtype I would expect that this should just work? As @NumesSanguis <https://github.com/NumesSanguis> says above, converting via float does work, e.g. >>> pd.Series(["1", "2", "3"]).astype(float).astype(pd.Int64Dtype()) 0 1 1 2 2 3 dtype: Int64 This is using >>> pd.__version__ '1.0.3' @TomAugspurger <https://github.com/TomAugspurger> - do you think a new issue needs to be opened for this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25472 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOISOMPTHZT64PP2CIM3RSJS5NANCNFSM4G2ZDGJA> .

lukestanbra · 2020-05-21T21:42:34Z

OK - that's good to know. Gets a bit too into the internals for me to follow, but was interesting to see how you all talk about this kind of stuff. If anyone else stumbles across this the relevant issues are 33254, 32586 and 33607

dekiesel · 2020-06-24T14:09:04Z

@NumesSanguis

Any ideas for a workaround if the integer (18 places) is too big for float64?

NumesSanguis · 2020-06-25T04:20:32Z

@dekiesel
Sorry I don't know

alexreg · 2021-10-09T04:19:09Z

Still no news about this? It seems like quite a significant bug, and has been open an extremely long time!

jreback · 2021-10-09T04:33:05Z

@alexreg you or anyone is welcome to submit a PR to patch and the core team can review

alexreg · 2021-10-09T16:16:12Z

@jreback I'm not sure I'm a good person to analyse the root of this problem, but I'll have a look anyway, and if I can figure it out, will submit a PR.

Resolves pandas-dev#25472, resolves pandas-dev#25288.

Resolves pandas-dev#25472, pandas-dev#25288.

Resolves pandas-dev#25472.

teto mentioned this issue Feb 28, 2019

Tracking of dependency related bugs (pandas mostly): teto/pymptcpanalyzer#14

Open

gfyoung added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels Mar 1, 2019

mroeschke added ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Info Clarification about behavior needed to assess issue labels Mar 8, 2020

mroeschke added the Bug label May 21, 2020

bastienboutonnet mentioned this issue Sep 4, 2020

[Dependency Regression/Keep Track] Wait for Pandas to potentially handle IntegerDType string conversion bastienboutonnet/sheetwork#204

Open

mlondschien mentioned this issue Nov 4, 2020

BUG: Collection of inconsistencies in .astype conversions #37626

Open

3 tasks

jwillis0720 mentioned this issue Jan 4, 2021

patch for pandas 1.0 on mixed null and numeric values aiguofer/gspread-pandas#49

Merged

mroeschke added the Dtype Conversions Unexpected or buggy dtype conversions label Jun 27, 2021

alexreg added a commit to alexreg/pandas that referenced this issue Oct 9, 2021

BUG: permit str dtype -> IntegerDtype conversions

a80118e

Resolves pandas-dev#25472, resolves pandas-dev#25288.

alexreg mentioned this issue Oct 9, 2021

Fix str dtype -> IntegerDtype conversions #43949

Merged

4 tasks

alexreg added a commit to alexreg/pandas that referenced this issue Oct 9, 2021

BUG: permit str dtype -> IntegerDtype conversions

9e19ece

Resolves pandas-dev#25472, resolves pandas-dev#25288.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 10, 2021

BUG: permit str dtype -> IntegerDtype conversions

90c00c3

Resolves pandas-dev#25472, resolves pandas-dev#25288.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 10, 2021

BUG: permit str dtype -> IntegerDtype conversions

1d5f7aa

Resolves pandas-dev#25472, resolves pandas-dev#25288.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021

BUG: permit str dtype -> IntegerDtype conversions

ef7888c

Resolves pandas-dev#25472, pandas-dev#25288.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021

BUG: permit str dtype -> IntegerDtype conversions

e00d1cc

Resolves pandas-dev#25472, pandas-dev#25288.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021

BUG: permit str dtype -> IntegerDtype conversions

1517f67

Resolves pandas-dev#25472.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021

BUG: permit str dtype -> IntegerDtype conversions

8b6d12a

Resolves pandas-dev#25472.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021

TST: read_csv to nullable int dtype (pandas-dev#25472)

bcd4fec

alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021

TST: read_csv to nullable int dtype (pandas-dev#25472)

815e185

alexreg added a commit to alexreg/pandas that referenced this issue Oct 17, 2021

TST: read_csv to nullable int dtype (pandas-dev#25472)

520261b

alexreg added a commit to alexreg/pandas that referenced this issue Oct 18, 2021

BUG: permit str dtype -> IntegerDtype conversions

c2c11ee

Resolves pandas-dev#25472.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 18, 2021

TST: read_csv to nullable int dtype (pandas-dev#25472)

5c21bd0

jreback added this to the 1.4 milestone Oct 18, 2021

jreback closed this as completed in #43949 Oct 19, 2021

smcgivern mentioned this issue Sep 18, 2022

Speed up scrapers obrasier/cricketstats#8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv fails with `TypeError: object cannot be converted to an IntegerDtype` yet succeeds when reading chunks #25472

read_csv fails with `TypeError: object cannot be converted to an IntegerDtype` yet succeeds when reading chunks #25472

teto commented Feb 28, 2019 •

edited

Loading

TomAugspurger commented Feb 28, 2019

teto commented Feb 28, 2019

TomAugspurger commented Feb 28, 2019 via email

gfyoung commented Mar 1, 2019

teto commented Mar 1, 2019

mrimal commented Mar 1, 2019 •

edited

Loading

teto commented Mar 1, 2019

TomAugspurger commented Mar 5, 2019

teto commented Mar 7, 2019

TomAugspurger commented Mar 7, 2019

kprestel commented Mar 13, 2019

NumesSanguis commented Apr 16, 2019 •

edited

Loading

lukestanbra commented May 19, 2020

TomAugspurger commented May 19, 2020 via email

lukestanbra commented May 21, 2020

dekiesel commented Jun 24, 2020

NumesSanguis commented Jun 25, 2020

alexreg commented Oct 9, 2021

jreback commented Oct 9, 2021

alexreg commented Oct 9, 2021

read_csv fails with TypeError: object cannot be converted to an IntegerDtype yet succeeds when reading chunks #25472

read_csv fails with TypeError: object cannot be converted to an IntegerDtype yet succeeds when reading chunks #25472

Comments

teto commented Feb 28, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Expected Output

Output of pd.show_versions()

TomAugspurger commented Feb 28, 2019

teto commented Feb 28, 2019

TomAugspurger commented Feb 28, 2019 via email

gfyoung commented Mar 1, 2019

teto commented Mar 1, 2019

mrimal commented Mar 1, 2019 • edited Loading

teto commented Mar 1, 2019

TomAugspurger commented Mar 5, 2019

teto commented Mar 7, 2019

TomAugspurger commented Mar 7, 2019

kprestel commented Mar 13, 2019

NumesSanguis commented Apr 16, 2019 • edited Loading

lukestanbra commented May 19, 2020

TomAugspurger commented May 19, 2020 via email

lukestanbra commented May 21, 2020

dekiesel commented Jun 24, 2020

NumesSanguis commented Jun 25, 2020

alexreg commented Oct 9, 2021

jreback commented Oct 9, 2021

alexreg commented Oct 9, 2021

read_csv fails with `TypeError: object cannot be converted to an IntegerDtype` yet succeeds when reading chunks #25472

read_csv fails with `TypeError: object cannot be converted to an IntegerDtype` yet succeeds when reading chunks #25472

teto commented Feb 28, 2019 •

edited

Loading

Output of `pd.show_versions()`

mrimal commented Mar 1, 2019 •

edited

Loading

NumesSanguis commented Apr 16, 2019 •

edited

Loading