GH13347 ENH Allow true_values and false_values options in read_excel #14002

WillAyd · 2016-08-15T00:47:17Z

[ X] closes Allow true_values and false_values options in read_excel #13347
[ X] tests added / passed
[ X] passes git diff upstream/master | flake8 --diff
[ X] whatsnew entry

jreback · 2016-08-15T00:49:49Z

pandas/io/excel.py

@@ -373,6 +388,12 @@ def _parse_cell(cell_contents, cell_typ):
                val = int(cell_contents)
                if val == cell_contents:
                    cell_contents = val
+            elif true_values and cell_contents in true_values:
+                # GH13347


you are going to need to make a set of these if they r not none

eg this will break with a scalar

Just so I'm clear you are asking to change it from:
elif true_values and cell_contents in true_values:
to
elif true_values and cell_contents in set(true_values):

you can try but that will fail as well

if true_values=5 or whatever

codecov-io · 2016-08-15T01:20:30Z

Current coverage is 85.24% (diff: 100%)

Merging #14002 into master will decrease coverage by <.01%

@@             master     #14002   diff @@
==========================================
  Files           140        140          
  Lines         50563      50563          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43103      43102     -1   
- Misses         7460       7461     +1   
  Partials          0          0

Powered by Codecov. Last update 54ab5be...15ec67f

jreback · 2016-08-15T11:34:55Z

@WillAyd ok, the issue is for some other things like this we accept scalar and lists. We need validation on this (in the parser code) to turn a scalar to a list (and a test for this).

WillAyd · 2016-08-15T22:46:37Z

Thanks Jeff. I did however notice that the keyword arguments in read_csv don't do anything to turn scalars into a list. For instance, if I had a line in test.csv that was simply "foo,bar" then running the below would not convert the record foo to a bool.

df = pd.read_csv('test.csv', true_values='foo')

I have to specify the argument as:

df = pd.read_csv('test.csv', true_values=['foo'])

to make it work. I am thinking of two options:

Add the aforementioned validation to read_excel or
Have read_excel call the same function for bool checking as read_csv (maybe_convert_bool from pandas.lib)

The former would obviously fix the problem at hand but make the two read functions behave slightly differently. The second one would keep them consistent but require more work or an additional change to add the validation you are describing. Let me know which route you prefer.

jreback · 2016-08-15T23:06:52Z

they should behave the same. You can for now mimic read_csv and we can come back in a later issue and fix this (as its a bug in read_csv). Or could do it all here.

WillAyd · 2016-08-17T00:57:54Z

Latest commit should have fixed this to add the appropriate keyword arguments of true_values and false_values to the TextParser object referenced in the excel.py file. This keeps it consistent with read_csv. Updated asv_bench screenshot attached

jreback · 2016-08-17T22:01:17Z

doc/source/whatsnew/v0.19.0.txt

@@ -412,6 +412,7 @@ Other enhancements

 - Added documentation to :ref:`I/O<io.dtypes>` regarding the perils of reading in columns with mixed dtypes and how to handle it (:issue:`13746`)
 - Raise ImportError for in the sql functions when sqlalchemy is not installed and a connection string is used (:issue:`11920`).
+- :func:`read_excel` now supports the true_values and false_values keyword arguments (:issue: `13347`)


use double back ticks around true_values, false_values

no space (:issue:13347)

WillAyd · 2016-08-17T23:05:48Z

Thanks for the callout on mixing strings / floats, etc... into the testing scenario. I realized my code isn't working when you mix in floats into the file and seems to be replacing them with np.nan. I'll dig further back in and try to see what's going on

jreback · 2016-09-09T22:35:41Z

can you rebase / update?

closes pandas-dev#13347 Author: willayd <william.ayd@gmail.com> Squashes the following commits: 242a6a6 ENH: Allow true_values and fals_values options in read_excel

WillAyd · 2016-09-12T01:31:11Z

Rebase should now be complete

jorisvandenbossche · 2016-09-13T23:07:34Z

@WillAyd Thanks!

adrivsh · 2016-09-14T00:03:40Z

@WillAyd, note that 'foo' is not a scalar, but a string, which is a list of characters. I think your code converts 'f' and 'o' to booleans
Just flagging this in case it's useful

jreback · 2016-09-14T00:11:41Z

@adrivsh you would have to show an example
strings are normally treated similarly to scalars

adrivsh · 2016-09-14T16:21:23Z

The example is

import pandas as pd
pd.DataFrame(["f","o","o"]).to_csv("foo.csv")
pd.read_csv("foo.csv", false_values="foo")

The result is:

Unnamed: 0 0
0 0 False
1 1 False
2 2 False

What i am saying is that python duck-types strings as list of characters, so that false_values= "foo" and false_values = ["f","o","o"] will produce the same result, unless some code prevents that

adrivsh · 2016-09-14T16:24:12Z

Interestingly,

import pandas as pd
pd.DataFrame(["f","o","o","b","a","r"]).to_csv("foo.csv")
pd.read_csv("foo.csv", false_values="foo")

does not transform "f"s and "o"s to False, so I actually have no idea what i am talking about.

jorisvandenbossche · 2016-09-15T15:29:10Z

@adrivsh I think that is possibly correct, but is a bit a pitfall (and is not documented very well). The string is regarded as a list of characters, if you pass it a single element list it works as expected:

In [15]: pd.read_csv("foo.csv", false_values=["foo"])
Out[15]: 
   Unnamed: 0  0
0           0  f
1           1  o
2           2  o

@adrivsh do you want to open a new issue for that?

* commit 'v0.19.0rc1-25-ga7469cf': (471 commits) ENH: Add divmod to series and index. (pandas-dev#14208) Fix generator tests to run (pandas-dev#14245) BUG: GH13629 Binned groupby median function with empty bins (pandas-dev#14225) TST/TEMP: fix pyqt to 4.x for plotting tests (pandas-dev#14240) DOC: added example to Series.map showing use of na_action parameter (GH14231) DOC: split docstring into multiple lines in excel.py (pandas-dev#14073) MAINT: Use __module__ in _DeprecatedModule. (pandas-dev#14181) ENH: Allow true_values and false_values options in read_excel (pandas-dev#14002) DOC: fix incorrect example in unstack docstring (GH14206) (pandas-dev#14211) BUG: iloc fails with non lex-sorted MultiIndex pandas-dev#13797 BUG: add check for infinity in __call__ of EngFormatter In gbq.to_gbq allow the DataFrame column order to differ from schema BLD: require cython if tempita is needed DOC: add source links to api docs (pandas-dev#14200) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values ...

jreback reviewed Aug 15, 2016
View reviewed changes

jorisvandenbossche added Enhancement IO Excel read_excel, to_excel labels Aug 15, 2016

WillAyd force-pushed the true-false-excel-options branch from da3367f to e3d0849 Compare August 17, 2016 00:53

jreback reviewed Aug 17, 2016
View reviewed changes

WillAyd force-pushed the true-false-excel-options branch from e3d0849 to 49be012 Compare August 17, 2016 22:38

ENH: Allow true_values and false_values options in read_excel

15ec67f

closes pandas-dev#13347 Author: willayd <william.ayd@gmail.com> Squashes the following commits: 242a6a6 ENH: Allow true_values and fals_values options in read_excel

WillAyd force-pushed the true-false-excel-options branch from 49be012 to 15ec67f Compare September 12, 2016 01:29

jorisvandenbossche added this to the 0.19.0 milestone Sep 13, 2016

jorisvandenbossche merged commit cebc70c into pandas-dev:master Sep 13, 2016

WillAyd deleted the true-false-excel-options branch December 12, 2017 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH13347 ENH Allow true_values and false_values options in read_excel #14002

GH13347 ENH Allow true_values and false_values options in read_excel #14002

WillAyd commented Aug 15, 2016

jreback Aug 15, 2016

WillAyd Aug 15, 2016

jreback Aug 15, 2016

codecov-io commented Aug 15, 2016 •

edited

Loading

jreback commented Aug 15, 2016

WillAyd commented Aug 15, 2016 •

edited

Loading

jreback commented Aug 15, 2016

WillAyd commented Aug 17, 2016 •

edited

Loading

jreback Aug 17, 2016

jreback Aug 17, 2016

WillAyd commented Aug 17, 2016

jreback commented Sep 9, 2016

WillAyd commented Sep 12, 2016

jorisvandenbossche commented Sep 13, 2016

adrivsh commented Sep 14, 2016 •

edited

Loading

jreback commented Sep 14, 2016

adrivsh commented Sep 14, 2016 •

edited

Loading

adrivsh commented Sep 14, 2016

jorisvandenbossche commented Sep 15, 2016 •

edited

Loading

GH13347 ENH Allow true_values and false_values options in read_excel #14002

GH13347 ENH Allow true_values and false_values options in read_excel #14002

Conversation

WillAyd commented Aug 15, 2016

jreback Aug 15, 2016

Choose a reason for hiding this comment

WillAyd Aug 15, 2016

Choose a reason for hiding this comment

jreback Aug 15, 2016

Choose a reason for hiding this comment

codecov-io commented Aug 15, 2016 • edited Loading

Current coverage is 85.24% (diff: 100%)

jreback commented Aug 15, 2016

WillAyd commented Aug 15, 2016 • edited Loading

jreback commented Aug 15, 2016

WillAyd commented Aug 17, 2016 • edited Loading

jreback Aug 17, 2016

Choose a reason for hiding this comment

jreback Aug 17, 2016

Choose a reason for hiding this comment

WillAyd commented Aug 17, 2016

jreback commented Sep 9, 2016

WillAyd commented Sep 12, 2016

jorisvandenbossche commented Sep 13, 2016

adrivsh commented Sep 14, 2016 • edited Loading

jreback commented Sep 14, 2016

adrivsh commented Sep 14, 2016 • edited Loading

adrivsh commented Sep 14, 2016

jorisvandenbossche commented Sep 15, 2016 • edited Loading

codecov-io commented Aug 15, 2016 •

edited

Loading

WillAyd commented Aug 15, 2016 •

edited

Loading

WillAyd commented Aug 17, 2016 •

edited

Loading

adrivsh commented Sep 14, 2016 •

edited

Loading

adrivsh commented Sep 14, 2016 •

edited

Loading

jorisvandenbossche commented Sep 15, 2016 •

edited

Loading