BUG: read_table() ignores dtype argument when multi-character separator is specified #6607

mcwitt · 2014-03-11T22:19:14Z

related #4363
closes #3374

Here is a minimal example:

In [21]: df = pd.DataFrame({'A': range(5), 'B': rand(5)})

In [22]: df
Out[22]: 
   A         B
0  0  0.402616
1  1  0.880696
2  2  0.184491
3  3  0.832732
4  4  0.393917

[5 rows x 2 columns]

In [23]: df.to_csv('test.csv', sep=' ')

In [24]: pd.read_csv('test.csv', sep=' ', dtype={'A': np.float64}).dtypes
Out[24]: 
Unnamed: 0      int64
A             float64
B             float64
dtype: object

Here the dtype argument behaves as expected, and column A has type float. However with sep='\s' the dtype argument appears to be ignored:

In [25]: pd.read_csv('test.csv', sep='\s', dtype={'A': np.float64}).dtypes
Out[25]: 
A      int64
B    float64
dtype: object

Version information

In [27]: show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 10.8.0
machine: i386
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1-413-ga71ede3
Cython: 0.20.1
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.0.0-dev
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.2
lxml: 3.3.1
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.2
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

jreback · 2014-03-11T23:53:58Z

this might be a dup of #4363
will mark as a bug for now

would you like to take a stab at fixing it?

mcwitt · 2014-03-12T00:02:22Z

Sure, I would like to take a look at it.

jreback · 2014-03-12T00:08:08Z

https://github.com/pydata/pandas/wiki

lots of tests in io/tests/test_parser.py that u can model like

this might affect python and/or c parsers

lmk if u need help

mcwitt · 2014-03-12T17:09:53Z

OK, thanks for the info. I'll get started looking at the wiki.

On Tue, Mar 11, 2014 at 5:08 PM, jreback notifications@github.com wrote:

https://github.com/pydata/pandas/wiki

lots of tests in io/tests/test_parser.py that u can model like

this might affect python and/or c parsers

lmk if u need help

Reply to this email directly or view it on GitHubhttps://github.com//issues/6607#issuecomment-37361830
.

jseabold · 2014-03-14T17:57:03Z

I guess I'm running into the same thing

[~/statsmodels/statsmodels-skipper/statsmodels/datasets/co2/src/]
[40]: dta = """MLO    580329   4    0         316.10
....: MLO    580405   6    0         317.30
....: MLO    580412   4    0         317.60
....: MLO    580419   6    0         317.50
....: MLO    580426   2    0         316.40
....: """

[~/statsmodels/statsmodels-skipper/statsmodels/datasets/co2/src/]
[41]: dtype = {0 : object, 1 : object, 2 : int, 3 : int, 4 : float}

[~/statsmodels/statsmodels-skipper/statsmodels/datasets/co2/src/]
[42]: pd.read_table(StringIO(dta), sep=" *", header=None, dtype=dtype).dtypes
[42]: 
0     object
1      int64
2      int64
3      int64
4    float64
dtype: object

mcwitt · 2014-03-16T02:49:55Z

It appears that the C parser isn't able to handle regular expressions (yet?) and when a multi-character separator is passed TextFileReader silently falls back to the Python engine. From pandas/io/parsers.py:536:

        if sep is None and not delim_whitespace:
            if engine == 'c':
                engine = 'python'
        elif sep is not None and len(sep) > 1:
            # wait until regex engine integrated
            if engine not in ('python', 'python-fwf'):
                engine = 'python'

But the Python engine doesn't understand dtype and raises an error when it is explicitly chosen:

In [87]: pd.read_csv('test.csv', sep=' ', engine='python', dtype={'A': np.float64})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
ValueError: The 'dtype' option is not supported with the 'python' engine

In my case, where fields are delimited by whitespace, I can replace sep='\s+' with delim_whitespace=True, which is implemented in the C engine.

I'm thinking it would be polite to warn the user when falling back to the Python parser causes options to be ignored, but maybe this isn't worth the trouble if the C engine will be able to handle regexps in the future.

jreback · 2014-03-17T12:14:19Z

no the PR is to fix this in the python parser. c-parser cannot handle a regex. Not easy to fix, so no time soon on this.

mcwitt · 2014-03-17T20:52:48Z

no the PR is to fix this in the python parser. c-parser cannot handle a
regex. Not easy to fix, so no time soon on this.

OK, agreed.

On Mon, Mar 17, 2014 at 5:14 AM, jreback notifications@github.com wrote:

no the PR is to fix this in the python parser. c-parser cannot handle a
regex. Not easy to fix, so no time soon on this.

Reply to this email directly or view it on GitHubhttps://github.com//issues/6607#issuecomment-37808751
.

mcwitt · 2014-03-21T22:27:10Z

I've made some progress on this. As mentioned above, the C engine is silently failing over to python in the case of a regexp separator (and several other cases). Consequently options specific to the C parser (dtype, skip_footer...) are silently ignored. I see several ways to proceed:

Add warnings when failing over to python, but otherwise leave things the way they are. The converters option is already supported by python and can accomplish almost the same thing as dtype. (could even alias dtype->converters, although this is probably bad because they aren't exactly the same thing)
Raise an error when failing over to python and C-specific options are specified. This breaks several tests in io/tests/test_parsers.py which rely on silent failover to python.
Modify the python parser code to handle dtype correctly. This is beyond what I understand how to do right now given the complications of treating NA values.

(I'm also thinking it might be nice to alias sep='\s+' to delim_whitespace=True to handle this common case in the C parser...)

So far I've made some progress on (2), but am having second thoughts about whether this is the best way to go. Would appreciate any advice.

jreback · 2014-03-21T23:00:11Z

good analysis, I think doing a combination would be worthwhile

translate sep='\s+' to delim_whitespace=True I think is fine, need some tests specifically for this (I think this should always work, but need to validate)
provide a warning if falling back to python from the c parser (e.g. its not specified). something like ParserWarning("falling back from the c-engine to the python-parser because this option <....> was specified and is not supported in the c-engine; you can also specify engine='python' to avoid this warning") (make this a new warning)
you could provide a warning in addition to the above if dtype is specified with the c-engine (e.g. just tack ithe message on that the user can use converters (I don't recall why/how they are different)
can tackle 3) in a future issue, put up check boxes in the top of this issue to track all of these things

great job on finding the issues!

mcwitt · 2014-03-22T23:34:46Z

OK, I've implemented all of this except instead of translating sep='\s+' to delim_whitespace=True I'm playing it safe and producing a warning:

ParserWarning: Note that sep='\s+' is equivalent to delim_whitespace=True, which is supported by the 'c' engine.

(I had added the translation but this was producing errors in several tests in parser_tests.py (e.g. test_read_table_buglet_4x_multiindex) that have sep='\s+' and cause problems with the C engine.)

I have also added test_warn_on_fallback_to_python() to test_parsers.py, but I'm having a (minor?) problem with this: since warnings are only produced on the first execution and several tests before mine trigger ParserWarnings, assert_produces_warning(ParserWarning) in my test fails. I haven't been able to find any places in the testing code where this issue is worked around. Is there a nice way to temporarily forget the warning history in my test code before assert_produces_warning?

jreback · 2014-03-22T23:39:22Z

I think you can do the translation, and just check that the c-engine output is == to the python-engine output (you will have to do this in a separate test class probably). The other test classes uses self.read_csv which cycles thru multiple parsers to test.

I think this would help with your problme about triggering warnings. Essentially have it run only once, and have the tests explicity pass an engine (as you are explicity testing certain engine behavior)

mcwitt · 2014-03-22T23:57:47Z

I think you can do the translation, and just check that the c-engine output is == to the python-engine output (you will have to do this in a separate test class probably).

OK, maybe TestCompareParsers or something similar?

There are several tests in TestParsers (i.e. which are run for all engines) that have C-unsupported options e.g. skip_footer>0 and consequently are just testing the python engine multiple times (meanwhile triggering the new warnings). Can these tests be moved to TestPythonParser?

The test test_read_table_buglet_4x_multiindex() has sep='\s+' but fails with the C parser after translating to delim_whitespace=True, so I think this is exposing a flaw in the C parser. This could also be moved to TestPythonParser for the time being and in the future we could make a new issue of it.

jreback · 2014-03-23T00:11:25Z

go ahead and create TestCompareParsers

if u can take tests that currently just fallback (eg the skip footer one), put a copy of the test that falls back and assert for the parser warning - so when the issue is eventually fixed that test will no longer produce a warning and will then fail (alerting the person changing the code)

you can put a big comment around the test as well.

what I find is that I insert code to fix an issue rather than trying to find if their is an existing test for the issue (which there usually is not ) - so this will provide essentially a nice alert

jreback · 2014-03-23T00:14:49Z

so bottom line is that a test that currently falls back on a failed option should be

moved to python parser
a copy put in c parser but with an assertion that it produces a warning and a big comment that links the 2 tests (as eventually when the issue is fixed they should be merged back into 1 test that is run in multiple engines)

jreback · 2014-04-09T20:03:08Z

@mcwitt how's this coming?

mcwitt · 2014-04-09T21:22:59Z

Sorry for the delay. The warnings and translation of sep='\s+' to
delim_whitespace=True is finished. I'm working on the last few broken
tests, copying tests that only test the Python parser to TestPythonParser
and asserting that they produce a warning when run under ParserTests. So
all that's left I think is to finish the job with the tests and create
TestCompareParsers with a test case to make sure the translation works.

On Wed, Apr 9, 2014 at 1:03 PM, jreback notifications@github.com wrote:

@mcwitt https://github.com/mcwitt how's this coming?

Reply to this email directly or view it on GitHubhttps://github.com//issues/6607#issuecomment-40009848
.

jreback · 2014-04-09T21:27:19Z

great! this would definintly be a welcome addition!

mcwitt · 2014-04-09T23:58:27Z

OK, all of the tests are passing except for several that cause an issue with the C parser:

test_read_table_buglet_4x_multiindex (test_parsers.ParserTests)
test_round_trip_frame (test_clipboard.TestClipboard)
test_round_trip_frame_string (test_clipboard.TestClipboard)

The failures seem to be caused by the C parser adding an extra column when lines begin with spaces. For example

In [3]: data = ' a b c\n 1 2 3\n 4 5 6'

In [4]: pd.read_table(StringIO(data), engine='c', delim_whitespace=True)
Out[4]: 
   Unnamed: 0  a  b  c
0         NaN  1  2  3
1         NaN  4  5  6

[2 rows x 4 columns]

Previously these tests fell back to the python parser (due to sep='\s+' being passed) which correctly ignores leading spaces:

In [5]: pd.read_table(StringIO(data), engine='python', sep='\s+')
Out[5]: 
   a  b  c
0  1  2  3
1  4  5  6

[2 rows x 3 columns]

jreback · 2014-04-10T00:01:31Z

hmm. seem that the c parser is wrong there (though of course their maybe some that works around this wrongness!).

it makes more sense to do what the python parser is doing. are the tests actually different? or is it the c-parser?

mcwitt · 2014-04-10T00:05:39Z

Looks like #3374 is relevant.

are the tests actually different? or is it the c-parser?

Not sure what you mean... The tests expect the behavior of the python parser because previously they fell back to python due to sep='\s+'. Now that we're translating this to delim_whitespace=True the result is the unexpected behavior, hence the tests failing.

jreback · 2014-04-10T00:08:03Z

yep...looks like #3374 should be closed by you as well.

ok... can you fix the c-parser then?

(I meant that sometimes you know that something is broken so you write a test that checks the broken behavior), sort of wrong but you need a test, so it happens.

jreback · 2014-04-10T00:09:15Z

and definitiely need to raise of BOTH delim_whitespace and sep are specified (I would say raise EVEN if sep is '\s+', because its really an error.

mcwitt · 2014-04-10T17:56:30Z

ok... can you fix the c-parser then?

I'll look at this.

and definitiely need to raise of BOTH delim_whitespace and sep are specified (I would say raise EVEN if sep is '\s+', because its really an error.

Agreed. I think a ValueError is appropriate in these cases?

jreback · 2014-04-10T17:57:34Z

yep

mcwitt · 2014-04-16T17:49:56Z

I realize this looks like a mess... unfortunately I don't know how it can be made any cleaner. Copying tests that fall thru to python from ParserTests to TestPythonParser (and asserting that they produce a ParserWarning when run under the C engine) account for the vast majority of the changes, since there is now some duplication.

Maybe it would be best to remove these tests from ParserTests for now, leaving just a comment? I have added a dedicated test that a ParserWarning is produced under the right circumstances.

jreback added Bug labels Mar 11, 2014

jreback added this to the 0.14.0 milestone Mar 11, 2014

mcwitt closed this as completed Mar 12, 2014

mcwitt reopened this Mar 12, 2014

jreback mentioned this issue Mar 22, 2014

ENH: Allow subplots axes to be passed into DataFrame.plot() #6306

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

mcwitt mentioned this issue Apr 16, 2014

BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser #6889

Merged

jreback modified the milestones: 0.14.0, 0.15.0 Apr 21, 2014

jreback closed this as completed in #6889 Apr 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_table() ignores dtype argument when multi-character separator is specified #6607

BUG: read_table() ignores dtype argument when multi-character separator is specified #6607

mcwitt commented Mar 11, 2014

jreback commented Mar 11, 2014

mcwitt commented Mar 12, 2014

jreback commented Mar 12, 2014

mcwitt commented Mar 12, 2014

jseabold commented Mar 14, 2014

mcwitt commented Mar 16, 2014

jreback commented Mar 17, 2014

mcwitt commented Mar 17, 2014

mcwitt commented Mar 21, 2014

jreback commented Mar 21, 2014

mcwitt commented Mar 22, 2014

jreback commented Mar 22, 2014

mcwitt commented Mar 22, 2014

jreback commented Mar 23, 2014

jreback commented Mar 23, 2014

jreback commented Apr 9, 2014

mcwitt commented Apr 9, 2014

jreback commented Apr 9, 2014

mcwitt commented Apr 9, 2014

jreback commented Apr 10, 2014

mcwitt commented Apr 10, 2014

jreback commented Apr 10, 2014

jreback commented Apr 10, 2014

mcwitt commented Apr 10, 2014

jreback commented Apr 10, 2014

mcwitt commented Apr 16, 2014

BUG: read_table() ignores dtype argument when multi-character separator is specified #6607

BUG: read_table() ignores dtype argument when multi-character separator is specified #6607

Comments

mcwitt commented Mar 11, 2014

Version information

jreback commented Mar 11, 2014

mcwitt commented Mar 12, 2014

jreback commented Mar 12, 2014

mcwitt commented Mar 12, 2014

jseabold commented Mar 14, 2014

mcwitt commented Mar 16, 2014

jreback commented Mar 17, 2014

mcwitt commented Mar 17, 2014

mcwitt commented Mar 21, 2014

jreback commented Mar 21, 2014

mcwitt commented Mar 22, 2014

jreback commented Mar 22, 2014

mcwitt commented Mar 22, 2014

jreback commented Mar 23, 2014

jreback commented Mar 23, 2014

jreback commented Apr 9, 2014

mcwitt commented Apr 9, 2014

jreback commented Apr 9, 2014

mcwitt commented Apr 9, 2014

jreback commented Apr 10, 2014

mcwitt commented Apr 10, 2014

jreback commented Apr 10, 2014

jreback commented Apr 10, 2014

mcwitt commented Apr 10, 2014

jreback commented Apr 10, 2014

mcwitt commented Apr 16, 2014