Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support decimal option in PythonParser #12933 #13189

Closed
wants to merge 10 commits into from
Closed

ENH: support decimal option in PythonParser #12933 #13189

wants to merge 10 commits into from

Conversation

ccronca
Copy link
Contributor

@ccronca ccronca commented May 15, 2016

return lines

if self.thousands is None:
nonnum = re.compile('[^-^0-9^%s]+' % self.decimal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be created in init

@jreback jreback added Enhancement IO CSV read_csv, to_csv labels May 16, 2016
@jreback jreback added this to the 0.18.2 milestone May 16, 2016
@codecov-io
Copy link

codecov-io commented May 16, 2016

Current coverage is 84.16%

Merging #13189 into master will decrease coverage by <.01%

@@             master     #13189   diff @@
==========================================
  Files           138        138          
  Lines         50496      50404    -92   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          42501      42418    -83   
+ Misses         7995       7986     -9   
  Partials          0          0          

Powered by Codecov. Last updated by b88eb35...dc8ca62

@jreback
Copy link
Contributor

jreback commented May 16, 2016

can you post a quick benchmark on how this differents from current (with python engine; you will have to actually use the decimal as the sep otherwise won't have a good comparison), but just for sake of completness.

@@ -38,6 +38,8 @@ Other enhancements
idx = pd.Index(["a1a2", "b1", "c1"])
idx.str.extractall("[ab](?P<digit>\d)")

- Support decimal option in PythonParser (:issue:`12933`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`
so something like

pd.read_csv() with engine='python' gained support for the decimal option.

@jreback
Copy link
Contributor

jreback commented May 16, 2016

minor comments. pls squash; ping when green.

@gfyoung any comments?

@gfyoung
Copy link
Member

gfyoung commented May 16, 2016

@camilocot :

  1. Awesome that you got this to work!
  2. Can you check this test and see what you get for the Python engine now? The test (or the comment at least) should be changed. Otherwise, LGTM.

@ccronca
Copy link
Contributor Author

ccronca commented May 16, 2016

@jreback regarding the benchmark, I'll read some asv documentation before posting it or with a %timeit is enought ?

@gfyoung
Copy link
Member

gfyoung commented May 16, 2016

Read the asv documentation. Showing the results of those benchmark tests will be a lot more convincing to get this merged in.

@gfyoung
Copy link
Member

gfyoung commented May 16, 2016

Also, I looked at the test again that I asked you to change (with the comment), and I am unfortunately no longer convinced that a comment change is sufficient. Can you change the self.assertRaises to tm.assertRaisesRegexp(ValueError, <errmsg>, self.read_csv, StringIO(data), decimal='')? That will be stronger testing-wise. Thanks!

@jreback
Copy link
Contributor

jreback commented May 16, 2016

yeah you can add to the asv benchmarks. I don't expect them to change significantly (as this is a special passed option), but nice to add a benchmark and (show the results here).

@ccronca
Copy link
Contributor Author

ccronca commented May 18, 2016

@jreback: asv output of two new benchmarks using python engine:

▶ asv continuous master 12933  -b parser_vb.read_csv_pyth 
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash dc8ca622:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..............................................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running parser_vb.read_csv_python_engine.time_read_csv_default_converter                                                                                                         2.86ms
[ 50.00%] ··· Running parser_vb.read_csv_python_engine.time_read_csv_default_converter_with_decimal                                                                                            9.47ms
[ 50.00%] · For pandas commit hash 86f68e6a:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running parser_vb.read_csv_python_engine.time_read_csv_default_converter                                                                                                         2.87ms
[100.00%] ··· Running parser_vb.read_csv_python_engine.time_read_csv_default_converter_with_decimal                                                                                            failed
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@gfyoung test assert was fixed, could you please take a look ?

@gfyoung
Copy link
Member

gfyoung commented May 18, 2016

@camilocot: Test assertion fix LGTM. Thanks!


def setup(self):
self.data_decimal = '0,1213700904466425978256438611;0,0525708283766902484401839501;0,4174092731488769913994474336\n 0,4096341697147408700274695547;0,1587830198973579909349496119;0,1292545832485494372576795285\n 0,8323255650024565799327547210;0,9694902427379478160318626578;0,6295047811546814475747169126\n 0,4679375305798131323697930383;0,2963942381834381301075609371;0,5268936082160610157032465394\n 0,6685382761849776311890991564;0,6721207066140679753374342908;0,6519975277021627935170045020\n '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these benchmarks exist for the c-engine as well (ideally we would use the same exact data so we can compare)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, added one benchmark with with same date for c-engine. Here are the results:

▶ asv continuous master 12933  -b parser_vb.read_csv_default 
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
· Running 8 total benchmarks (2 commits * 1 environments * 4 benchmarks)
[  0.00%] · For pandas commit hash 465272e1:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...............................................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 12.50%] ··· Running parser_vb.read_csv_default_converter.time_read_csv_default_converter                                                                                                     1.99ms
[ 25.00%] ··· Running parser_vb.read_csv_default_converter_python_engine.time_read_csv_default_converter                                                                                       2.87ms
[ 37.50%] ··· Running parser_vb.read_csv_default_converter_with_decimal.time_read_csv_default_converter_with_decimal                                                                           2.00ms
[ 50.00%] ··· Running parser_vb.read_csv_default_converter_with_decimal_python_engine.time_read_csv_default_converter_with_decimal                                                             9.24ms
[ 50.00%] · For pandas commit hash 86f68e6a:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 62.50%] ··· Running parser_vb.read_csv_default_converter.time_read_csv_default_converter                                                                                                     1.97ms
[ 75.00%] ··· Running parser_vb.read_csv_default_converter_python_engine.time_read_csv_default_converter                                                                                       2.87ms
[ 87.50%] ··· Running parser_vb.read_csv_default_converter_with_decimal.time_read_csv_default_converter_with_decimal                                                                           1.99ms
[100.00%] ··· Running parser_vb.read_csv_default_converter_with_decimal_python_engine.time_read_csv_default_converter_with_decimal                                                             failed
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback jreback closed this in 19ebee5 May 22, 2016
@jreback
Copy link
Contributor

jreback commented May 22, 2016

thanks @camilocot

RahulHP pushed a commit to RahulHP/pandas that referenced this pull request May 27, 2016
closes pandas-dev#12933

Author: Camilo Cota <ccota@riplife.es>

Closes pandas-dev#13189 from camilocot/12933 and squashes the following commits:

465272e [Camilo Cota] Benchmark decimal option in read_csv for c engine
9f42d0c [Camilo Cota] double backticks around decimal and engine='python'
dc8ca62 [Camilo Cota] fix test_empty_decimal_marker comment
49613fe [Camilo Cota] Assert read_csv error message in test_empty_decimal_marker
d821052 [Camilo Cota] fix test_empty_decimal_marker comment
f71509d [Camilo Cota] Include descritive what's new line
803356e [Camilo Cota] set nonnum regex in init method
1472d80 [Camilo Cota] Include the issue number in what's new
b560fda [Camilo Cota] Fix what's new
dc7acd1 [Camilo Cota] ENH: support decimal option in PythonParser pandas-dev#12933
RahulHP added a commit to RahulHP/pandas that referenced this pull request May 28, 2016
ENH: support decimal option in PythonParser pandas-dev#12933

closes pandas-dev#12933

Author: Camilo Cota <ccota@riplife.es>

Closes pandas-dev#13189 from camilocot/12933 and squashes the following commits:

465272e [Camilo Cota] Benchmark decimal option in read_csv for c engine
9f42d0c [Camilo Cota] double backticks around decimal and engine='python'
dc8ca62 [Camilo Cota] fix test_empty_decimal_marker comment
49613fe [Camilo Cota] Assert read_csv error message in test_empty_decimal_marker
d821052 [Camilo Cota] fix test_empty_decimal_marker comment
f71509d [Camilo Cota] Include descritive what's new line
803356e [Camilo Cota] set nonnum regex in init method
1472d80 [Camilo Cota] Include the issue number in what's new
b560fda [Camilo Cota] Fix what's new
dc7acd1 [Camilo Cota] ENH: support decimal option in PythonParser pandas-dev#12933

ENH: Allow to_sql to recognize single sql type pandas-dev#11886

PEP pandas-dev#3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support decimal option in PythonParser
4 participants