Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH13347 ENH Allow true_values and false_values options in read_excel #14002

Merged

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Aug 15, 2016

@@ -373,6 +388,12 @@ def _parse_cell(cell_contents, cell_typ):
val = int(cell_contents)
if val == cell_contents:
cell_contents = val
elif true_values and cell_contents in true_values:
# GH13347
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are going to need to make a set of these if they r not none

eg this will break with a scalar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm clear you are asking to change it from:
elif true_values and cell_contents in true_values:
to
elif true_values and cell_contents in set(true_values):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can try but that will fail as well

if true_values=5 or whatever

@codecov-io
Copy link

codecov-io commented Aug 15, 2016

Current coverage is 85.24% (diff: 100%)

Merging #14002 into master will decrease coverage by <.01%

@@             master     #14002   diff @@
==========================================
  Files           140        140          
  Lines         50563      50563          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43103      43102     -1   
- Misses         7460       7461     +1   
  Partials          0          0          

Powered by Codecov. Last update 54ab5be...15ec67f

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

@WillAyd ok, the issue is for some other things like this we accept scalar and lists. We need validation on this (in the parser code) to turn a scalar to a list (and a test for this).

@WillAyd
Copy link
Member Author

WillAyd commented Aug 15, 2016

Thanks Jeff. I did however notice that the keyword arguments in read_csv don't do anything to turn scalars into a list. For instance, if I had a line in test.csv that was simply "foo,bar" then running the below would not convert the record foo to a bool.

df = pd.read_csv('test.csv', true_values='foo')

I have to specify the argument as:

df = pd.read_csv('test.csv', true_values=['foo'])

to make it work. I am thinking of two options:

  1. Add the aforementioned validation to read_excel or
  2. Have read_excel call the same function for bool checking as read_csv (maybe_convert_bool from pandas.lib)

The former would obviously fix the problem at hand but make the two read functions behave slightly differently. The second one would keep them consistent but require more work or an additional change to add the validation you are describing. Let me know which route you prefer.

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

they should behave the same. You can for now mimic read_csv and we can come back in a later issue and fix this (as its a bug in read_csv). Or could do it all here.

@WillAyd WillAyd force-pushed the true-false-excel-options branch from da3367f to e3d0849 Compare August 17, 2016 00:53
@WillAyd
Copy link
Member Author

WillAyd commented Aug 17, 2016

Latest commit should have fixed this to add the appropriate keyword arguments of true_values and false_values to the TextParser object referenced in the excel.py file. This keeps it consistent with read_csv. Updated asv_bench screenshot attached
asv_read_excel

@@ -412,6 +412,7 @@ Other enhancements

- Added documentation to :ref:`I/O<io.dtypes>` regarding the perils of reading in columns with mixed dtypes and how to handle it (:issue:`13746`)
- Raise ImportError for in the sql functions when sqlalchemy is not installed and a connection string is used (:issue:`11920`).
- :func:`read_excel` now supports the true_values and false_values keyword arguments (:issue: `13347`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back ticks around true_values, false_values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no space (:issue:13347)

@WillAyd WillAyd force-pushed the true-false-excel-options branch from e3d0849 to 49be012 Compare August 17, 2016 22:38
@WillAyd
Copy link
Member Author

WillAyd commented Aug 17, 2016

Thanks for the callout on mixing strings / floats, etc... into the testing scenario. I realized my code isn't working when you mix in floats into the file and seems to be replacing them with np.nan. I'll dig further back in and try to see what's going on

@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

can you rebase / update?

closes pandas-dev#13347

Author: willayd <william.ayd@gmail.com>

Squashes the following commits:

242a6a6 ENH: Allow true_values and fals_values options in read_excel
@WillAyd WillAyd force-pushed the true-false-excel-options branch from 49be012 to 15ec67f Compare September 12, 2016 01:29
@WillAyd
Copy link
Member Author

WillAyd commented Sep 12, 2016

Rebase should now be complete

@jorisvandenbossche jorisvandenbossche added this to the 0.19.0 milestone Sep 13, 2016
@jorisvandenbossche jorisvandenbossche merged commit cebc70c into pandas-dev:master Sep 13, 2016
@jorisvandenbossche
Copy link
Member

@WillAyd Thanks!

@adrivsh
Copy link

adrivsh commented Sep 14, 2016

@WillAyd, note that 'foo' is not a scalar, but a string, which is a list of characters. I think your code converts 'f' and 'o' to booleans
Just flagging this in case it's useful

@jreback
Copy link
Contributor

jreback commented Sep 14, 2016

@adrivsh you would have to show an example
strings are normally treated similarly to scalars

@adrivsh
Copy link

adrivsh commented Sep 14, 2016

The example is

import pandas as pd
pd.DataFrame(["f","o","o"]).to_csv("foo.csv")
pd.read_csv("foo.csv", false_values="foo")

The result is:

Unnamed: 0 0
0 0 False
1 1 False
2 2 False

What i am saying is that python duck-types strings as list of characters, so that false_values= "foo" and false_values = ["f","o","o"] will produce the same result, unless some code prevents that

@adrivsh
Copy link

adrivsh commented Sep 14, 2016

Interestingly,

import pandas as pd
pd.DataFrame(["f","o","o","b","a","r"]).to_csv("foo.csv")
pd.read_csv("foo.csv", false_values="foo")

does not transform "f"s and "o"s to False, so I actually have no idea what i am talking about.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 15, 2016

@adrivsh I think that is possibly correct, but is a bit a pitfall (and is not documented very well). The string is regarded as a list of characters, if you pass it a single element list it works as expected:

In [15]: pd.read_csv("foo.csv", false_values=["foo"])
Out[15]: 
   Unnamed: 0  0
0           0  f
1           1  o
2           2  o

@adrivsh do you want to open a new issue for that?

yarikoptic added a commit to neurodebian/pandas that referenced this pull request Oct 13, 2016
* commit 'v0.19.0rc1-25-ga7469cf': (471 commits)
  ENH: Add divmod to series and index. (pandas-dev#14208)
  Fix generator tests to run (pandas-dev#14245)
  BUG: GH13629 Binned groupby median function with empty bins (pandas-dev#14225)
  TST/TEMP: fix pyqt to 4.x for plotting tests (pandas-dev#14240)
  DOC: added example to Series.map showing use of na_action parameter (GH14231)
  DOC: split docstring into multiple lines in excel.py (pandas-dev#14073)
  MAINT: Use __module__ in _DeprecatedModule. (pandas-dev#14181)
  ENH: Allow true_values and false_values options in read_excel (pandas-dev#14002)
  DOC: fix incorrect example in unstack docstring (GH14206) (pandas-dev#14211)
  BUG: iloc fails with non lex-sorted MultiIndex pandas-dev#13797
  BUG: add check for infinity in __call__ of EngFormatter
  In gbq.to_gbq allow the DataFrame column order to differ from schema
  BLD: require cython if tempita is needed
  DOC: add source links to api docs (pandas-dev#14200)
  BUG: compat with Stata ver 111
  Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198)
  BLD: Test for Python 3.5 with C locale
  BUG: DatetimeTZBlock can't assign values near dst boundary
  BUG: union_categorical with Series and cat idx
  BUG: fix str.contains for series containing only nan values
  ...
@WillAyd WillAyd deleted the true-false-excel-options branch December 12, 2017 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow true_values and false_values options in read_excel
5 participants