Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Fix performance regression in infer_dtype #30202

Merged
merged 11 commits into from
Dec 23, 2019

Conversation

groutr
Copy link
Contributor

@groutr groutr commented Dec 11, 2019

Fixes major performance regression in infer_dtype introduced in
aaaac86

Fixes performance regression introduced in
aaaac86
@groutr
Copy link
Contributor Author

groutr commented Dec 11, 2019

Profiling the following function (before/after):

def slow_infer_dtype():
    df = pd.DataFrame(np.ones((100000, 1000)))
    for col in df.columns:
        infer_dtype(df[col], skipna=True)
master:
300127297 function calls (300127289 primitive calls) in 67.103 seconds
infer_dtype_perf:
126297 function calls (126289 primitive calls) in 1.050 seconds

@groutr groutr changed the title Move skipna check to after type casting of values. PERF: Fix performance regression in infer_dtype Dec 11, 2019
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

We should maybe add a asv benchmark to ensure we catch such regression in the future.

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Dec 11, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Dec 11, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an asv benchmark (put in dtypes.py), ideally with a fair number of types that we can test

Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@groutr
Copy link
Contributor Author

groutr commented Dec 11, 2019

I seem to have some issues getting asv running. ASV consistently fails to compile the cython extensions.

@jbrockmendel
Copy link
Member

nice speedup!

@pep8speaks
Copy link

pep8speaks commented Dec 11, 2019

Hello @groutr! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-17 00:16:18 UTC

@TomAugspurger
Copy link
Contributor

@groutr one more linting issue:

black --version
black, version 19.10b0
Checking black formatting
would reformat /home/runner/work/pandas/pandas/asv_bench/benchmarks/dtypes.py
Oh no! 💥 💔 💥
1 file would be reformatted, 885 files would be left unchanged.

@TomAugspurger
Copy link
Contributor

Were you able to get the benchmarks running? Roughly how long do your new ones take?

For testing you can do asv dev -b InferDtypes, which will reuse your same python / pandas install.

@groutr
Copy link
Contributor Author

groutr commented Dec 12, 2019

@TomAugspurger oh, sorry, I didn't know we ran black on asv_bench too.
I was eventually able to get asv running.

[  0.00%] ·· Benchmarking existing-py_home_grout_miniconda3_envs_pandas-dev_bin_python
[ 25.00%] ··· dtypes.InferDtypes.time_infer                                                                                                                        ok
[ 25.00%] ··· ============= =========
                  dtype
              ------------- ---------
                np-object    318±0μs
                py-object    335±0μs
                 np-null     342±0μs
                 py-null     329±0μs
                  np-int     323±0μs
               np-floating   342±0μs
                  empty      329±0μs
                  bytes      321±0μs
              ============= =========

[ 50.00%] ··· dtypes.InferDtypes.time_infer_skipna                                                                                                                 ok
[ 50.00%] ··· ============= =========
                  dtype
              ------------- ---------
                np-object    410±0μs
                py-object    431±0μs
                 np-null     418±0μs
                 py-null     418±0μs
                  np-int     413±0μs
               np-floating   414±0μs
                  empty      415±0μs
                  bytes      415±0μs
              ============= =========

@TomAugspurger
Copy link
Contributor

Ah one more error linting error @groutr, sorry :) You might want to setup pre-commit: https://dev.pandas.io/docs/development/contributing.html#python-pep8-black

@datapythonista I don't recall, but before moving CI checks to GitHub actions, did we run all the checks and only report on failures at the end? So if there are both black and isort issues, you'd get both reported, instead of just the first failure? Sorry if that's a known issue.

@datapythonista
Copy link
Member

I'm quite sure that when we moved to github actions all checks were running even if they were failures, but that stopped working. I reported it to github one or two weeks ago, but haven't heard from them yet.

@jreback
Copy link
Contributor

jreback commented Dec 15, 2019

@groutr can you show the asv for master vs this branch.

@groutr
Copy link
Contributor Author

groutr commented Dec 16, 2019

@jreback
Running:

git checkout infer_dtype_perf
conda activate pandas-dev
cd asv_bench
asv continuous -f 1.1 master infer_dtype_perf -b dtypes.InferDtypes

Produces identical timings for both master and infer_dtype_perf.
I don't understand why, since my own timings show a major performance improvement.

EDIT: Please disregard this comment. I had written the benchmarking code incorrectly.

@groutr
Copy link
Contributor Author

groutr commented Dec 17, 2019

(pandas-dev) ❯ asv continuous upstream/master infer_dtype_perf -b dtypes.InferDtypes
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing 903abc2f <infer_dtype_perf> into conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit 37dfcc1a <master> (round 1/2):
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 12.50%] ··· Running (dtypes.InferDtypes.time_infer--)..
[ 25.00%] · For pandas commit 903abc2f <infer_dtype_perf> (round 1/2):
[ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 37.50%] ··· Running (dtypes.InferDtypes.time_infer--)..
[ 50.00%] · For pandas commit 903abc2f <infer_dtype_perf> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 62.50%] ··· dtypes.InferDtypes.time_infer                                                                                                                        ok
[ 62.50%] ··· ============= =============
                  dtype                  
              ------------- -------------
                np-object     1.95±0.2ms 
                py-object    4.51±0.04ms 
                 np-null      4.04±0.1μs 
                 py-null     4.49±0.04ms 
                  np-int      3.97±0.2μs 
               np-floating    3.96±0.4μs 
                  empty      9.32±0.06μs 
                  bytes      4.16±0.03ms 
              ============= =============

[ 75.00%] ··· dtypes.InferDtypes.time_infer_skipna                                                                                                                 ok
[ 75.00%] ··· ============= =============
                  dtype                  
              ------------- -------------
                np-object    7.09±0.06ms 
                py-object     9.66±0.1ms 
                 np-null      3.95±0.1μs 
                 py-null     6.92±0.06ms 
                  np-int     3.87±0.04μs 
               np-floating   3.94±0.03μs 
                  empty       14.3±0.2μs
                  bytes      9.53±0.07ms
              ============= =============

[ 75.00%] · For pandas commit 37dfcc1a <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 87.50%] ··· dtypes.InferDtypes.time_infer                                                                                                                        ok
[ 87.50%] ··· ============= =============
                  dtype
              ------------- -------------
                np-object    1.92±0.05ms
                py-object    4.47±0.08ms
                 np-null      4.09±0.7μs
                 py-null      4.56±0.2ms
                  np-int      4.06±0.1μs
               np-floating    3.94±0.2μs
                  empty       9.34±0.3μs
                  bytes      4.22±0.06ms
              ============= =============

[100.00%] ··· dtypes.InferDtypes.time_infer_skipna                                                                                                                 ok
[100.00%] ··· ============= =============
                  dtype
              ------------- -------------
                np-object     7.24±0.3ms
                py-object     9.84±0.3ms
                 np-null      11.7±0.2ms
                 py-null     7.04±0.08ms
                  np-int      11.5±0.4ms
               np-floating    11.9±0.5ms
                  empty       14.3±0.2μs
                  bytes       9.62±0.1ms
              ============= =============

       before           after         ratio
     [37dfcc1a]       [903abc2f]
     <master>         <infer_dtype_perf>
-      11.7±0.2ms       3.95±0.1μs     0.00  dtypes.InferDtypes.time_infer_skipna('np-null')
-      11.5±0.4ms      3.87±0.04μs     0.00  dtypes.InferDtypes.time_infer_skipna('np-int')
-      11.9±0.5ms      3.94±0.03μs     0.00  dtypes.InferDtypes.time_infer_skipna('np-floating')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice speedup! Can you add a whatsnew note to 1.0 performance improvements?

@groutr
Copy link
Contributor Author

groutr commented Dec 17, 2019

@WillAyd, I already did in ea579f7

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback merged commit db022e2 into pandas-dev:master Dec 23, 2019
@jreback
Copy link
Contributor

jreback commented Dec 23, 2019

thanks

AlexKirko pushed a commit to AlexKirko/pandas that referenced this pull request Dec 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

infer_dtype() function slower in latest version
8 participants