Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: ascii c string functions #23981

Merged
merged 9 commits into from
Dec 5, 2018
Merged

Conversation

chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Nov 28, 2018

This ends up being kind of hairy - open to suggestions on a better fix. cc @dragoljub, @cgohlke

Timings

df = pd.DataFrame(np.random.randn(1000000, 10), 
    columns=('COL{}'.format(i) for i in range(10)))
df.to_csv('~/tmp.csv')

# python 3.6 - pandas 0.23.4
In [1]: %timeit pd.read_csv('~/tmp.csv')
4.55 s ± 344 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# python 3.7 - before
In [3]: %timeit pd.read_csv('~/tmp.csv')
21.8 s ± 769 ms per loop (mean ± std. dev. of 7 runs, 1 loop ea

# python 3.7 - after
In [1]: %timeit pd.read_csv('~/tmp.csv')
4.25 s ± 168 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@pep8speaks
Copy link

Hello @chris-b1! Thanks for submitting the PR.

@chris-b1 chris-b1 added Performance Memory or execution speed performance Windows Windows OS labels Nov 28, 2018
@jreback
Copy link
Contributor

jreback commented Nov 28, 2018

can you add a whatsnew note in perf

@dragoljub
Copy link

@chris-b1 great work! 👍 The simple fix always seems to add a few more layers of complexity.

Can someone build/try this on Linux to make sure there is no regression there? 😉

@jbrockmendel
Copy link
Member

Did tracking this problem down give any insight into whether/when we can de-duplicate parse_helper.xstrtod and tokenizer.xstrtod?

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

@chris-b1 I think you have some linting errors, can you also rebase on master & add a whatsnew

andas/_libs/src/parser/tokenizer.c:26:  At least two spaces is best between code and comments  [whitespace/comments] [2]

@chris-b1
Copy link
Contributor Author

chris-b1 commented Dec 3, 2018

@jbrockmendel - It will take a bit of refactoring to deal with the maybe_int out param - the parse_helper one needs its, the csv parsing one doesn't. Will have that as a follow-up, have #19361 on it.

@codecov
Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #23981 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23981   +/-   ##
=======================================
  Coverage   42.38%   42.38%           
=======================================
  Files         161      161           
  Lines       51701    51701           
=======================================
  Hits        21914    21914           
  Misses      29787    29787
Flag Coverage Δ
#single 42.38% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08395af...8af538f. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #23981 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23981      +/-   ##
==========================================
- Coverage    92.2%    92.2%   -0.01%     
==========================================
  Files         162      162              
  Lines       51714    51714              
==========================================
- Hits        47682    47681       -1     
- Misses       4032     4033       +1
Flag Coverage Δ
#multiple 90.6% <ø> (ø) ⬆️
#single 43.01% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 92.61% <0%> (-0.48%) ⬇️
pandas/util/testing.py 87.51% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f2c716...5abc57d. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2018

can you merge master and push again.

@jreback jreback added this to the 0.24.0 milestone Dec 4, 2018
@jreback
Copy link
Contributor

jreback commented Dec 4, 2018

@chris-b1 this will fail on 3.6 build (fix was just merged), but ping when ready otherwise.

@chris-b1
Copy link
Contributor Author

chris-b1 commented Dec 4, 2018

@jreback - other than CI issues, I think this is good to go?

@jreback
Copy link
Contributor

jreback commented Dec 4, 2018

yeah just waiting for a good ci run (don't rebase yet).

@jreback
Copy link
Contributor

jreback commented Dec 5, 2018

looks like ci is fixed up, can you rebase

@jreback jreback merged commit 8ea7744 into pandas-dev:master Dec 5, 2018
@jreback
Copy link
Contributor

jreback commented Dec 5, 2018

thanks @chris-b1

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Dec 6, 2018
commit 28c61d770f6dfca6857fd0fa6979d4119a31129e
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 12:18:19 2018 -0600

    uncomment

commit bae2e322523efc73a1344464f51611e2dc555ccb
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 12:17:09 2018 -0600

    maybe fixes

commit 6cb4db05c9d6ceba3794096f0172cae5ed5f6019
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 09:57:37 2018 -0600

    we back

commit d97ab57fb32cb23371169d9ed659ccfac34cfe45
Merge: a117de4 b78aa8d
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 09:51:51 2018 -0600

    Merge remote-tracking branch 'upstream/master' into disown-tz-only-rebased2

commit b78aa8d
Author: gfyoung <gfyoung17+GitHub@gmail.com>
Date:   Thu Dec 6 07:18:44 2018 -0500

    REF/TST: Add pytest idiom to reshape/test_tile (pandas-dev#24107)

commit 2993b8e
Author: gfyoung <gfyoung17+GitHub@gmail.com>
Date:   Thu Dec 6 07:17:55 2018 -0500

    REF/TST: Add more pytest idiom to scalar/test_nat (pandas-dev#24120)

commit b841374
Author: evangelineliu <hsiyinliu@gmail.com>
Date:   Wed Dec 5 18:21:46 2018 -0500

    BUG: Fix concat series loss of timezone (pandas-dev#24027)

commit 4ae63aa
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 5 14:44:50 2018 -0800

    Implement DatetimeArray._from_sequence (pandas-dev#24074)

commit 2643721
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 5 14:43:45 2018 -0800

    CLN: Follow-up to pandas-dev#24100 (pandas-dev#24116)

commit 8ea7744
Author: chris-b1 <cbartak@gmail.com>
Date:   Wed Dec 5 14:21:23 2018 -0600

    PERF: ascii c string functions (pandas-dev#23981)

commit cb862e4
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 5 12:19:46 2018 -0800

    BUG: fix mutation of DTI backing Series/DataFrame (pandas-dev#24096)

commit aead29b
Author: topper-123 <contribute@tensortable.com>
Date:   Wed Dec 5 19:06:00 2018 +0000

    API: rename MultiIndex.labels to MultiIndex.codes (pandas-dev#23752)
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Windows Windows OS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2
5 participants