Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add dtype= option to python parser #14295

Merged
merged 24 commits into from
Nov 26, 2016

Conversation

chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Sep 24, 2016

Ultimately I'm working towards #8212 (types in excel parser), which should be pretty straightforward after this.

Right now the tests are moved from c_parser_only.py, may need to add some more
cc @gfyoung

@codecov-io
Copy link

codecov-io commented Sep 25, 2016

Current coverage is 85.22% (diff: 100%)

Merging #14295 into master will increase coverage by 0.01%

@@             master     #14295   diff @@
==========================================
  Files           143        143          
  Lines         50804      50833    +29   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43292      43324    +32   
+ Misses         7512       7509     -3   
  Partials          0          0          

Powered by Codecov. Last update 75b606a...3abb0bd

@jreback jreback added IO CSV read_csv, to_csv Dtype Conversions Unexpected or buggy dtype conversions labels Sep 26, 2016
result[c] = cvals
if verbose and na_count:
print('Filled %d NA values in column %s' % (na_count, str(c)))
return result

def _convert_types(self, values, na_values, try_num_bool=True):
def _infer_types(self, values, na_values, try_num_bool=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are at it, can you add a docstring here?

not interpret dtype.
Use `str` or `object` to preserve and not interpret dtype.
If converters are specified, they will be applied AFTER
dtype conversion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test that exercises this assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, the c-parser actually ignores the dtype and just uses the converter - changed docstring and added matching test.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice to me! (added two minor comments)

values, conv_f, na_values,
col_na_values, col_na_fvalues)
else:
try_num_bool = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just this:

try_num_bool = cast_type and is_object_dtype(cast_type)

@@ -1355,6 +1392,23 @@ def _convert_types(self, values, na_values, try_num_bool=True):

return result, na_count

def _cast_types(self, values, cast_type, column):
""" cast column to type specified in dtypes= param """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I know this isn't done too much for internal functions, it would be good to start documenting these functions more thoroughly for development purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I know this isn't done too much for internal functions,

But we should do it more! (therefore my comment above) It makes familiarizing yourself with a module a lot easier.
BTW, a PR to go through a file and document some of the functions (eg things you haven been working on in recent PRs and so you know better) is always very welcome :-)

@jorisvandenbossche
Copy link
Member

@chris-b1 xref #14558. You say that dtype is ignored when converters is specified. In that case it would be nice to raise a warning about this if both are specified I think.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small comments, looks good to me!


.. ipython:: python

from io import StringIO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work on python 2. Should check how it is done in other places in the docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pandas.compat import StringIO is imported in a suppressed ipython code block at the top of the file, so you can just use StringIO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, the io module was backported to 2.6, but I will remove this, as it is already imported like you mentioned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 That's true, but then the input must be unicode strings, not plain python 2 strings (which I think is not the case in the docs, but not sure)

If converters are specified, they will be applied INSTEAD
of dtype conversion.

.. versionadded:: 0.20.0 support for the Python parser.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would leave this out of the docstring (this is already so long ..), but not strong feelings, your take

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Nov 11, 2016
@chris-b1 chris-b1 changed the title API: add dtype= option to python parser (WIP) API: add dtype= option to python parser Nov 13, 2016
@chris-b1
Copy link
Contributor Author

I think I've got all the feedback so far worked in - any more comments? @jorisvandenbossche, @gfyoung, @jreback

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as well

Specifying ``dtype`` with ``engine`` other than 'c' raises a
``ValueError``.
.. versionadded:: 0.20.0 support for the Python parser.
The ``dtype`` option is supported by the 'python' engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a blank line here (to avoid warnings)

@@ -32,6 +32,14 @@ Other enhancements

- ``pd.read_excel`` now preserves sheet order when using ``sheetname=None`` (:issue:`9930`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would make a sub-section

if conv:
if col_dtype is not None:
warnings.warn(("Both a converter and dtype were specified "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test for this? (IOW, this was prob tested for c-engine, but now need to test for all)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this was added here:e0d1606

@gfyoung
Copy link
Member

gfyoung commented Nov 22, 2016

@chris-b1 : I'm putting together a PR to patch #14712. As this directly relates to your PR, I would suggest waiting for mine to make sure the bug isn't an issue for you.

@jorisvandenbossche
Copy link
Member

@chris-b1 can you rebase this? Then this can be merged I think (depending on the status of @gfyoung new PR)

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.2, 0.20.0 Nov 24, 2016
@jorisvandenbossche
Copy link
Member

@chris-b1 I merged #14717. This added a dtype test to c_parser_only, so added a commit to move this to the general dtype tests as well. Feel free to merge if you're ok with this and the tests are green.


The ``dtype`` keyword argument in the :func:`read_csv` function for specifying the types of parsed columns
is now supported with the ``'python'`` engine. See the :ref:`io docs <io.dtypes>` for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue references here

@jorisvandenbossche jorisvandenbossche merged commit 75bb530 into pandas-dev:master Nov 26, 2016
@jorisvandenbossche
Copy link
Member

@chris-b1 Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants