ENH: support decimal option in PythonParser #12933 #13189

ccronca · 2016-05-15T18:44:45Z

closes ENH: support decimal option in PythonParser #12933
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2016-05-15T19:44:02Z

pandas/io/parsers.py

+            return lines
+
+        if self.thousands is None:
+            nonnum = re.compile('[^-^0-9^%s]+' % self.decimal)


these should be created in init

codecov-io · 2016-05-16T16:56:02Z

Current coverage is 84.16%

Merging #13189 into master will decrease coverage by <.01%

@@             master     #13189   diff @@
==========================================
  Files           138        138          
  Lines         50496      50404    -92   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          42501      42418    -83   
+ Misses         7995       7986     -9   
  Partials          0          0

Powered by Codecov. Last updated by b88eb35...dc8ca62

jreback · 2016-05-16T18:09:18Z

can you post a quick benchmark on how this differents from current (with python engine; you will have to actually use the decimal as the sep otherwise won't have a good comparison), but just for sake of completness.

jreback · 2016-05-16T18:10:47Z

doc/source/whatsnew/v0.18.2.txt

@@ -38,6 +38,8 @@ Other enhancements
     idx = pd.Index(["a1a2", "b1", "c1"])
     idx.str.extractall("[ab](?P<digit>\d)")

+- Support decimal option in PythonParser (:issue:`12933`)


`
so something like

pd.read_csv() with engine='python' gained support for the decimal option.

jreback · 2016-05-16T18:12:55Z

minor comments. pls squash; ping when green.

@gfyoung any comments?

gfyoung · 2016-05-16T18:25:18Z

@camilocot :

Awesome that you got this to work!
Can you check this test and see what you get for the Python engine now? The test (or the comment at least) should be changed. Otherwise, LGTM.

ccronca · 2016-05-16T20:48:40Z

@jreback regarding the benchmark, I'll read some asv documentation before posting it or with a %timeit is enought ?

gfyoung · 2016-05-16T20:50:57Z

Read the asv documentation. Showing the results of those benchmark tests will be a lot more convincing to get this merged in.

gfyoung · 2016-05-16T20:53:26Z

Also, I looked at the test again that I asked you to change (with the comment), and I am unfortunately no longer convinced that a comment change is sufficient. Can you change the self.assertRaises to tm.assertRaisesRegexp(ValueError, <errmsg>, self.read_csv, StringIO(data), decimal='')? That will be stronger testing-wise. Thanks!

jreback · 2016-05-16T21:00:32Z

yeah you can add to the asv benchmarks. I don't expect them to change significantly (as this is a special passed option), but nice to add a benchmark and (show the results here).

ccronca · 2016-05-18T20:26:19Z

@jreback: asv output of two new benchmarks using python engine:

▶ asv continuous master 12933  -b parser_vb.read_csv_pyth 
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash dc8ca622:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..............................................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running parser_vb.read_csv_python_engine.time_read_csv_default_converter                                                                                                         2.86ms
[ 50.00%] ··· Running parser_vb.read_csv_python_engine.time_read_csv_default_converter_with_decimal                                                                                            9.47ms
[ 50.00%] · For pandas commit hash 86f68e6a:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running parser_vb.read_csv_python_engine.time_read_csv_default_converter                                                                                                         2.87ms
[100.00%] ··· Running parser_vb.read_csv_python_engine.time_read_csv_default_converter_with_decimal                                                                                            failed
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@gfyoung test assert was fixed, could you please take a look ?

gfyoung · 2016-05-18T20:36:50Z

@camilocot: Test assertion fix LGTM. Thanks!

jreback · 2016-05-18T20:42:53Z

asv_bench/benchmarks/parser_vb.py

+
+    def setup(self):
+        self.data_decimal = '0,1213700904466425978256438611;0,0525708283766902484401839501;0,4174092731488769913994474336\n        0,4096341697147408700274695547;0,1587830198973579909349496119;0,1292545832485494372576795285\n        0,8323255650024565799327547210;0,9694902427379478160318626578;0,6295047811546814475747169126\n        0,4679375305798131323697930383;0,2963942381834381301075609371;0,5268936082160610157032465394\n        0,6685382761849776311890991564;0,6721207066140679753374342908;0,6519975277021627935170045020\n        '


do these benchmarks exist for the c-engine as well (ideally we would use the same exact data so we can compare)

@jreback, added one benchmark with with same date for c-engine. Here are the results:

▶ asv continuous master 12933 -b parser_vb.read_csv_default · Creating environments · Discovering benchmarks ·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt. ·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt. · Running 8 total benchmarks (2 commits * 1 environments * 4 benchmarks) [ 0.00%] · For pandas commit hash 465272e1: [ 0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt............................................... [ 0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 12.50%] ··· Running parser_vb.read_csv_default_converter.time_read_csv_default_converter 1.99ms [ 25.00%] ··· Running parser_vb.read_csv_default_converter_python_engine.time_read_csv_default_converter 2.87ms [ 37.50%] ··· Running parser_vb.read_csv_default_converter_with_decimal.time_read_csv_default_converter_with_decimal 2.00ms [ 50.00%] ··· Running parser_vb.read_csv_default_converter_with_decimal_python_engine.time_read_csv_default_converter_with_decimal 9.24ms [ 50.00%] · For pandas commit hash 86f68e6a: [ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 62.50%] ··· Running parser_vb.read_csv_default_converter.time_read_csv_default_converter 1.97ms [ 75.00%] ··· Running parser_vb.read_csv_default_converter_python_engine.time_read_csv_default_converter 2.87ms [ 87.50%] ··· Running parser_vb.read_csv_default_converter_with_decimal.time_read_csv_default_converter_with_decimal 1.99ms [100.00%] ··· Running parser_vb.read_csv_default_converter_with_decimal_python_engine.time_read_csv_default_converter_with_decimal failed SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2016-05-22T20:03:12Z

thanks @camilocot

closes pandas-dev#12933 Author: Camilo Cota <ccota@riplife.es> Closes pandas-dev#13189 from camilocot/12933 and squashes the following commits: 465272e [Camilo Cota] Benchmark decimal option in read_csv for c engine 9f42d0c [Camilo Cota] double backticks around decimal and engine='python' dc8ca62 [Camilo Cota] fix test_empty_decimal_marker comment 49613fe [Camilo Cota] Assert read_csv error message in test_empty_decimal_marker d821052 [Camilo Cota] fix test_empty_decimal_marker comment f71509d [Camilo Cota] Include descritive what's new line 803356e [Camilo Cota] set nonnum regex in init method 1472d80 [Camilo Cota] Include the issue number in what's new b560fda [Camilo Cota] Fix what's new dc7acd1 [Camilo Cota] ENH: support decimal option in PythonParser pandas-dev#12933

ENH: support decimal option in PythonParser pandas-dev#12933 closes pandas-dev#12933 Author: Camilo Cota <ccota@riplife.es> Closes pandas-dev#13189 from camilocot/12933 and squashes the following commits: 465272e [Camilo Cota] Benchmark decimal option in read_csv for c engine 9f42d0c [Camilo Cota] double backticks around decimal and engine='python' dc8ca62 [Camilo Cota] fix test_empty_decimal_marker comment 49613fe [Camilo Cota] Assert read_csv error message in test_empty_decimal_marker d821052 [Camilo Cota] fix test_empty_decimal_marker comment f71509d [Camilo Cota] Include descritive what's new line 803356e [Camilo Cota] set nonnum regex in init method 1472d80 [Camilo Cota] Include the issue number in what's new b560fda [Camilo Cota] Fix what's new dc7acd1 [Camilo Cota] ENH: support decimal option in PythonParser pandas-dev#12933 ENH: Allow to_sql to recognize single sql type pandas-dev#11886 PEP pandas-dev#3

Camilo Cota added 2 commits May 15, 2016 20:41

ENH: support decimal option in PythonParser #12933

dc7acd1

Fix what's new

b560fda

jreback reviewed May 15, 2016
View reviewed changes

jreback added Enhancement IO CSV read_csv, to_csv labels May 16, 2016

jreback added this to the 0.18.2 milestone May 16, 2016

Camilo Cota added 2 commits May 16, 2016 18:53

Include the issue number in what's new

1472d80

set nonnum regex in init method

803356e

jreback reviewed May 16, 2016
View reviewed changes

kawochen mentioned this pull request May 16, 2016

ENH/DOC/CLN: Document arguments and reconcile C and Python engines for read_csv #12686

Open

22 tasks

Camilo Cota added 2 commits May 16, 2016 22:23

Include descritive what's new line

f71509d

fix test_empty_decimal_marker comment

d821052

Camilo Cota added 2 commits May 17, 2016 21:24

Assert read_csv error message in test_empty_decimal_marker

49613fe

fix test_empty_decimal_marker comment

dc8ca62

jreback reviewed May 18, 2016
View reviewed changes

Camilo Cota added 2 commits May 22, 2016 17:04

double backticks around decimal and engine='python'

9f42d0c

Benchmark decimal option in read_csv for c engine

465272e

jreback closed this in 19ebee5 May 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support decimal option in PythonParser #12933 #13189

ENH: support decimal option in PythonParser #12933 #13189

ccronca commented May 15, 2016

jreback May 15, 2016

codecov-io commented May 16, 2016 •

edited

Loading

jreback commented May 16, 2016

jreback May 16, 2016

jreback commented May 16, 2016

gfyoung commented May 16, 2016

ccronca commented May 16, 2016

gfyoung commented May 16, 2016

gfyoung commented May 16, 2016 •

edited

Loading

jreback commented May 16, 2016

ccronca commented May 18, 2016

gfyoung commented May 18, 2016

jreback May 18, 2016

ccronca May 22, 2016

jreback commented May 22, 2016


		def setup(self):
		self.data_decimal = '0,1213700904466425978256438611;0,0525708283766902484401839501;0,4174092731488769913994474336\n 0,4096341697147408700274695547;0,1587830198973579909349496119;0,1292545832485494372576795285\n 0,8323255650024565799327547210;0,9694902427379478160318626578;0,6295047811546814475747169126\n 0,4679375305798131323697930383;0,2963942381834381301075609371;0,5268936082160610157032465394\n 0,6685382761849776311890991564;0,6721207066140679753374342908;0,6519975277021627935170045020\n '

ENH: support decimal option in PythonParser #12933 #13189

ENH: support decimal option in PythonParser #12933 #13189

Conversation

ccronca commented May 15, 2016

jreback May 15, 2016

Choose a reason for hiding this comment

codecov-io commented May 16, 2016 • edited Loading

Current coverage is 84.16%

jreback commented May 16, 2016

jreback May 16, 2016

Choose a reason for hiding this comment

jreback commented May 16, 2016

gfyoung commented May 16, 2016

ccronca commented May 16, 2016

gfyoung commented May 16, 2016

gfyoung commented May 16, 2016 • edited Loading

jreback commented May 16, 2016

ccronca commented May 18, 2016

gfyoung commented May 18, 2016

jreback May 18, 2016

Choose a reason for hiding this comment

ccronca May 22, 2016

Choose a reason for hiding this comment

jreback commented May 22, 2016

codecov-io commented May 16, 2016 •

edited

Loading

gfyoung commented May 16, 2016 •

edited

Loading