Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_table() ignores dtype argument when multi-character separator is specified #6607

Closed
mcwitt opened this issue Mar 11, 2014 · 26 comments · Fixed by #6889
Closed

BUG: read_table() ignores dtype argument when multi-character separator is specified #6607

mcwitt opened this issue Mar 11, 2014 · 26 comments · Fixed by #6889
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@mcwitt
Copy link
Contributor

mcwitt commented Mar 11, 2014

related #4363
closes #3374

Here is a minimal example:

In [21]: df = pd.DataFrame({'A': range(5), 'B': rand(5)})

In [22]: df
Out[22]: 
   A         B
0  0  0.402616
1  1  0.880696
2  2  0.184491
3  3  0.832732
4  4  0.393917

[5 rows x 2 columns]

In [23]: df.to_csv('test.csv', sep=' ')

In [24]: pd.read_csv('test.csv', sep=' ', dtype={'A': np.float64}).dtypes
Out[24]: 
Unnamed: 0      int64
A             float64
B             float64
dtype: object

Here the dtype argument behaves as expected, and column A has type float. However with sep='\s' the dtype argument appears to be ignored:

In [25]: pd.read_csv('test.csv', sep='\s', dtype={'A': np.float64}).dtypes
Out[25]: 
A      int64
B    float64
dtype: object

Version information

In [27]: show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 10.8.0
machine: i386
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1-413-ga71ede3
Cython: 0.20.1
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.0.0-dev
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.2
lxml: 3.3.1
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.2
pymysql: None
psycopg2: None
@jreback jreback added this to the 0.14.0 milestone Mar 11, 2014
@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

this might be a dup of #4363
will mark as a bug for now

would you like to take a stab at fixing it?

@mcwitt
Copy link
Contributor Author

mcwitt commented Mar 12, 2014

Sure, I would like to take a look at it.

@mcwitt mcwitt closed this as completed Mar 12, 2014
@mcwitt mcwitt reopened this Mar 12, 2014
@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

https://github.com/pydata/pandas/wiki

lots of tests in io/tests/test_parser.py that u can model like

this might affect python and/or c parsers

lmk if u need help

@mcwitt
Copy link
Contributor Author

mcwitt commented Mar 12, 2014

OK, thanks for the info. I'll get started looking at the wiki.

On Tue, Mar 11, 2014 at 5:08 PM, jreback notifications@github.com wrote:

https://github.com/pydata/pandas/wiki

lots of tests in io/tests/test_parser.py that u can model like

this might affect python and/or c parsers

lmk if u need help

Reply to this email directly or view it on GitHubhttps://github.com//issues/6607#issuecomment-37361830
.

@jseabold
Copy link
Contributor

I guess I'm running into the same thing

[~/statsmodels/statsmodels-skipper/statsmodels/datasets/co2/src/]
[40]: dta = """MLO    580329   4    0         316.10
....: MLO    580405   6    0         317.30
....: MLO    580412   4    0         317.60
....: MLO    580419   6    0         317.50
....: MLO    580426   2    0         316.40
....: """

[~/statsmodels/statsmodels-skipper/statsmodels/datasets/co2/src/]
[41]: dtype = {0 : object, 1 : object, 2 : int, 3 : int, 4 : float}

[~/statsmodels/statsmodels-skipper/statsmodels/datasets/co2/src/]
[42]: pd.read_table(StringIO(dta), sep=" *", header=None, dtype=dtype).dtypes
[42]: 
0     object
1      int64
2      int64
3      int64
4    float64
dtype: object

@mcwitt
Copy link
Contributor Author

mcwitt commented Mar 16, 2014

It appears that the C parser isn't able to handle regular expressions (yet?) and when a multi-character separator is passed TextFileReader silently falls back to the Python engine. From pandas/io/parsers.py:536:

        if sep is None and not delim_whitespace:
            if engine == 'c':
                engine = 'python'
        elif sep is not None and len(sep) > 1:
            # wait until regex engine integrated
            if engine not in ('python', 'python-fwf'):
                engine = 'python'

But the Python engine doesn't understand dtype and raises an error when it is explicitly chosen:

In [87]: pd.read_csv('test.csv', sep=' ', engine='python', dtype={'A': np.float64})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
ValueError: The 'dtype' option is not supported with the 'python' engine

In my case, where fields are delimited by whitespace, I can replace sep='\s+' with delim_whitespace=True, which is implemented in the C engine.

I'm thinking it would be polite to warn the user when falling back to the Python parser causes options to be ignored, but maybe this isn't worth the trouble if the C engine will be able to handle regexps in the future.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2014

no the PR is to fix this in the python parser. c-parser cannot handle a regex. Not easy to fix, so no time soon on this.

@mcwitt
Copy link
Contributor Author

mcwitt commented Mar 17, 2014

no the PR is to fix this in the python parser. c-parser cannot handle a
regex. Not easy to fix, so no time soon on this.

OK, agreed.

On Mon, Mar 17, 2014 at 5:14 AM, jreback notifications@github.com wrote:

no the PR is to fix this in the python parser. c-parser cannot handle a
regex. Not easy to fix, so no time soon on this.

Reply to this email directly or view it on GitHubhttps://github.com//issues/6607#issuecomment-37808751
.

@mcwitt
Copy link
Contributor Author

mcwitt commented Mar 21, 2014

I've made some progress on this. As mentioned above, the C engine is silently failing over to python in the case of a regexp separator (and several other cases). Consequently options specific to the C parser (dtype, skip_footer...) are silently ignored. I see several ways to proceed:

  1. Add warnings when failing over to python, but otherwise leave things the way they are. The converters option is already supported by python and can accomplish almost the same thing as dtype. (could even alias dtype->converters, although this is probably bad because they aren't exactly the same thing)
  2. Raise an error when failing over to python and C-specific options are specified. This breaks several tests in io/tests/test_parsers.py which rely on silent failover to python.
  3. Modify the python parser code to handle dtype correctly. This is beyond what I understand how to do right now given the complications of treating NA values.

(I'm also thinking it might be nice to alias sep='\s+' to delim_whitespace=True to handle this common case in the C parser...)

So far I've made some progress on (2), but am having second thoughts about whether this is the best way to go. Would appreciate any advice.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2014

good analysis, I think doing a combination would be worthwhile

  • translate sep='\s+' to delim_whitespace=True I think is fine, need some tests specifically for this (I think this should always work, but need to validate)
  • provide a warning if falling back to python from the c parser (e.g. its not specified). something like ParserWarning("falling back from the c-engine to the python-parser because this option <....> was specified and is not supported in the c-engine; you can also specify engine='python' to avoid this warning") (make this a new warning)
  • you could provide a warning in addition to the above if dtype is specified with the c-engine (e.g. just tack ithe message on that the user can use converters (I don't recall why/how they are different)
  • can tackle 3) in a future issue, put up check boxes in the top of this issue to track all of these things

great job on finding the issues!

@mcwitt
Copy link
Contributor Author

mcwitt commented Mar 22, 2014

OK, I've implemented all of this except instead of translating sep='\s+' to delim_whitespace=True I'm playing it safe and producing a warning:

ParserWarning: Note that sep='\s+' is equivalent to delim_whitespace=True, which is supported by the 'c' engine.

(I had added the translation but this was producing errors in several tests in parser_tests.py (e.g. test_read_table_buglet_4x_multiindex) that have sep='\s+' and cause problems with the C engine.)

I have also added test_warn_on_fallback_to_python() to test_parsers.py, but I'm having a (minor?) problem with this: since warnings are only produced on the first execution and several tests before mine trigger ParserWarnings, assert_produces_warning(ParserWarning) in my test fails. I haven't been able to find any places in the testing code where this issue is worked around. Is there a nice way to temporarily forget the warning history in my test code before assert_produces_warning?

@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

I think you can do the translation, and just check that the c-engine output is == to the python-engine output (you will have to do this in a separate test class probably). The other test classes uses self.read_csv which cycles thru multiple parsers to test.

I think this would help with your problme about triggering warnings. Essentially have it run only once, and have the tests explicity pass an engine (as you are explicity testing certain engine behavior)

@mcwitt
Copy link
Contributor Author

mcwitt commented Mar 22, 2014

I think you can do the translation, and just check that the c-engine output is == to the python-engine output (you will have to do this in a separate test class probably).

OK, maybe TestCompareParsers or something similar?

There are several tests in TestParsers (i.e. which are run for all engines) that have C-unsupported options e.g. skip_footer>0 and consequently are just testing the python engine multiple times (meanwhile triggering the new warnings). Can these tests be moved to TestPythonParser?

The test test_read_table_buglet_4x_multiindex() has sep='\s+' but fails with the C parser after translating to delim_whitespace=True, so I think this is exposing a flaw in the C parser. This could also be moved to TestPythonParser for the time being and in the future we could make a new issue of it.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2014

go ahead and create TestCompareParsers

if u can take tests that currently just fallback (eg the skip footer one), put a copy of the test that falls back and assert for the parser warning - so when the issue is eventually fixed that test will no longer produce a warning and will then fail (alerting the person changing the code)

you can put a big comment around the test as well.

what I find is that I insert code to fix an issue rather than trying to find if their is an existing test for the issue (which there usually is not ) - so this will provide essentially a nice alert

@jreback
Copy link
Contributor

jreback commented Mar 23, 2014

so bottom line is that a test that currently falls back on a failed option should be

moved to python parser
a copy put in c parser but with an assertion that it produces a warning and a big comment that links the 2 tests (as eventually when the issue is fixed they should be merged back into 1 test that is run in multiple engines)

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

@mcwitt how's this coming?

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 9, 2014

Sorry for the delay. The warnings and translation of sep='\s+' to
delim_whitespace=True is finished. I'm working on the last few broken
tests, copying tests that only test the Python parser to TestPythonParser
and asserting that they produce a warning when run under ParserTests. So
all that's left I think is to finish the job with the tests and create
TestCompareParsers with a test case to make sure the translation works.

On Wed, Apr 9, 2014 at 1:03 PM, jreback notifications@github.com wrote:

@mcwitt https://github.com/mcwitt how's this coming?

Reply to this email directly or view it on GitHubhttps://github.com//issues/6607#issuecomment-40009848
.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

great! this would definintly be a welcome addition!

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 9, 2014

OK, all of the tests are passing except for several that cause an issue with the C parser:

test_read_table_buglet_4x_multiindex (test_parsers.ParserTests)
test_round_trip_frame (test_clipboard.TestClipboard)
test_round_trip_frame_string (test_clipboard.TestClipboard)

The failures seem to be caused by the C parser adding an extra column when lines begin with spaces. For example

In [3]: data = ' a b c\n 1 2 3\n 4 5 6'

In [4]: pd.read_table(StringIO(data), engine='c', delim_whitespace=True)
Out[4]: 
   Unnamed: 0  a  b  c
0         NaN  1  2  3
1         NaN  4  5  6

[2 rows x 4 columns]

Previously these tests fell back to the python parser (due to sep='\s+' being passed) which correctly ignores leading spaces:

In [5]: pd.read_table(StringIO(data), engine='python', sep='\s+')
Out[5]: 
   a  b  c
0  1  2  3
1  4  5  6

[2 rows x 3 columns]

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

hmm. seem that the c parser is wrong there (though of course their maybe some that works around this wrongness!).

it makes more sense to do what the python parser is doing. are the tests actually different? or is it the c-parser?

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 10, 2014

Looks like #3374 is relevant.

are the tests actually different? or is it the c-parser?

Not sure what you mean... The tests expect the behavior of the python parser because previously they fell back to python due to sep='\s+'. Now that we're translating this to delim_whitespace=True the result is the unexpected behavior, hence the tests failing.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

yep...looks like #3374 should be closed by you as well.

ok... can you fix the c-parser then?

(I meant that sometimes you know that something is broken so you write a test that checks the broken behavior), sort of wrong but you need a test, so it happens.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

and definitiely need to raise of BOTH delim_whitespace and sep are specified (I would say raise EVEN if sep is '\s+', because its really an error.

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 10, 2014

ok... can you fix the c-parser then?

I'll look at this.

and definitiely need to raise of BOTH delim_whitespace and sep are specified (I would say raise EVEN if sep is '\s+', because its really an error.

Agreed. I think a ValueError is appropriate in these cases?

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

yep

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 16, 2014

I realize this looks like a mess... unfortunately I don't know how it can be made any cleaner. Copying tests that fall thru to python from ParserTests to TestPythonParser (and asserting that they produce a ParserWarning when run under the C engine) account for the vast majority of the changes, since there is now some duplication.

Maybe it would be best to remove these tests from ParserTests for now, leaving just a comment? I have added a dedicated test that a ParserWarning is produced under the right circumstances.

@jreback jreback modified the milestones: 0.14.0, 0.15.0 Apr 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants