Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: For GH23814, return early in Categorical.__init__ #23888

Merged
merged 15 commits into from
Nov 30, 2018

Conversation

eoveson
Copy link
Contributor

@eoveson eoveson commented Nov 24, 2018

@pep8speaks
Copy link

Hello @eoveson! Thanks for submitting the PR.

@eoveson eoveson changed the title Category perf PERF: For GH23814, return early in Categorical.__init__ Nov 24, 2018
@codecov
Copy link

codecov bot commented Nov 24, 2018

Codecov Report

Merging #23888 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23888      +/-   ##
==========================================
+ Coverage   92.29%    92.3%   +0.01%     
==========================================
  Files         161      161              
  Lines       51498    51556      +58     
==========================================
+ Hits        47530    47590      +60     
+ Misses       3968     3966       -2
Flag Coverage Δ
#multiple 90.7% <100%> (+0.01%) ⬆️
#single 42.43% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.4% <100%> (+0.04%) ⬆️
pandas/core/arrays/timedeltas.py 95.95% <0%> (-0.49%) ⬇️
pandas/plotting/_misc.py 38.68% <0%> (-0.31%) ⬇️
pandas/core/indexes/base.py 96.32% <0%> (-0.17%) ⬇️
pandas/core/arrays/datetimes.py 98.37% <0%> (-0.14%) ⬇️
pandas/tseries/offsets.py 96.84% <0%> (-0.14%) ⬇️
pandas/core/ops.py 94.14% <0%> (-0.14%) ⬇️
pandas/core/config.py 87.04% <0%> (-0.13%) ⬇️
pandas/io/sas/sas_xport.py 90.14% <0%> (-0.1%) ⬇️
pandas/io/formats/printing.py 93.01% <0%> (-0.08%) ⬇️
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d865e52...9e270e9. Read the comment docs.

@gfyoung gfyoung added Performance Memory or execution speed performance Categorical Categorical Data Type labels Nov 25, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 25, 2018

@eoveson : Thanks for the PR! Can you run asv to check performance benchmarks?

@@ -314,6 +314,16 @@ class Categorical(ExtensionArray, PandasObject):
def __init__(self, values, categories=None, ordered=None, dtype=None,
fastpath=False):

# GH23814, for perf, if no optional params used and values already an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can just move this down to where the fastpath check is now; you can add this on i think. this constructor is already amazing too complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at that point, the arg dtype, and maybe categories, will be set. I wanted to only use this early return if none of the optional args were specified (I believe @TomAugsperger was suggesting this in the issue thread).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eoveson I would still like to investiagte consolidating some of this code. This is a very complicated constructor and more code is not great here. See if you can add it lower down, even if its slightly lower perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback , Ok let me look into this and see if I can consolidate some of the code..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please check it out when you get a chance

@eoveson
Copy link
Contributor Author

eoveson commented Nov 25, 2018

@gfyoung -- yes, ran a subset of the asv suite (tried to target categorical), I can run the entire suite also. It reported no significant difference (maybe because there was no existing test for this scenario? -- which is why I added the new perf test)
(asv continuous -f 1.1 upstream/master category-perf -b ^categorical

@gfyoung
Copy link
Member

gfyoung commented Nov 25, 2018

(maybe because there was no existing test for this scenario? -- which is why I added the new perf test)

The test output should list all performance tests that were run. If it's not there, create a new branch off master with just the performance test added, and compare the two branches.

@eoveson
Copy link
Contributor Author

eoveson commented Nov 27, 2018

The test output should list all performance tests that were run. If it's not there, create a new branch off master with just the performance test added, and compare the two branches.

@gfyoung, Yes, I saw the test I added show in the output when I ran the command I mentioned. Should I run all of the asv tests (tried running all asv tests, but it failed when it was about 1/3 done with a file access error for a temporary file), or should I target categorical tests?

Btw, this is the error I saw when trying to run all the asv tests:

[ 32.67%] ▒▒▒ Running (index_object.Indexing.time_get_loc--).
Traceback (most recent call last):
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\erikov\AppData\Local\Continuum\anaconda3\scripts\asv.exe\__main__.py", line 9, in <module>
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\site-packages\asv\main.py", line 38, in main
    result = args.func(args)
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\site-packages\asv\commands\__init__.py", line 49, in run_from_args
    return cls.run_from_conf_args(conf, args)
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\site-packages\asv\commands\continuous.py", line 72,in run_from_conf_args
    launch_method=args.launch_method, **kwargs
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\site-packages\asv\commands\continuous.py", line 106, in run
    _returns=run_objs, _machine_file=_machine_file)
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\site-packages\asv\commands\run.py", line 406, in run
    launch_method=launch_method)
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\site-packages\asv\runner.py", line 349, in run_benchmarks
    cwd=cache_dir)
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\site-packages\asv\runner.py", line 515, in run_benchmark
    cwd=cwd)
  File "c:\users\erikov\appdata\local\continuum\anaconda3\lib\site-packages\asv\runner.py", line 647, in _run_benchmark_single_param
    os.remove(result_file.name)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\erikov\\AppData\\Local\\Temp\\tmpfq5htpg5'

@gfyoung
Copy link
Member

gfyoung commented Nov 27, 2018

Yes, I saw the test I added show in the output when I ran the command I mentioned. Should I run all of the asv tests (tried running all asv tests, but it failed when it was about 1/3 done with a file access error for a temporary file), or should I target categorical tests?

Running the Categorical tests is fine. I'm concerned though...you didn't see any noticeable improvement in performance, even with your newly added test?

@eoveson
Copy link
Contributor Author

eoveson commented Nov 27, 2018

Well, this change only helps in the case that they have passed in an existing instance of Categorical to Categorical.init, and used no optional params. Not sure how common that would be in the tests? I didn't see that test case in the file I added the test case to. But I'm also not exactly sure how this asv test suite works. How does it get a baseline to compare against (since machine specs are different)? Am I supposed to create a baseline on my machine without my changes, and then run with my changes? If so, I didn't do that. I simply ran the asv command I mentioned, so I'm not sure if I'm doing things correctly..

@gfyoung
Copy link
Member

gfyoung commented Nov 27, 2018

Well, this change only helps in the case that they have passed in an existing instance of Categorical to Categorical.init, and used no optional params.

Right, but didn't you say you saw no substantial changes in performance?

I simply ran the asv command I mentioned, so I'm not sure if I'm doing things correctly..

Can you copy / paste the output of your ASV?

@eoveson
Copy link
Contributor Author

eoveson commented Nov 27, 2018

I see now that you specifically mentioned the test I added. So I should have seen a difference for that test, so I guess I need to run with that new test, but without my real changes to init to create the baseline first?

@gfyoung
Copy link
Member

gfyoung commented Nov 27, 2018

so I guess I need to run with that new test, but without my real changes to init to create the baseline first?

Exactly. That's why I said earlier:

create a new branch off master with just the performance test added, and compare the two branches.

@eoveson
Copy link
Contributor Author

eoveson commented Nov 27, 2018

Exactly. That's why I said earlier:

create a new branch off master with just the performance test added, and compare the two branches.

Ah, makes sense, thanks. I'll compare the two branches and get back to you (I'll first work on the code consolidation requested by jreback since that may impact things).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny doc comment. ping when pushed.

@@ -1150,7 +1150,7 @@ Performance Improvements
- Improved performance of :func:`pd.concat` for `Series` objects (:issue:`23404`)
- Improved performance of :meth:`DatetimeIndex.normalize` and :meth:`Timestamp.normalize` for timezone naive or UTC datetimes (:issue:`23634`)
- Improved performance of :meth:`DatetimeIndex.tz_localize` and various ``DatetimeIndex`` attributes with dateutil UTC timezone (:issue:`23772`)

- Improved performance of :meth:`Categorical.__init__` (:issue:`23814`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say constructor rather than referring to __init__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback , updated doc string, and also added asv test that exercises the code (first one didn't, but left it since still useful) (you can see my comment about asv results to gfyoung)

@jreback jreback added this to the 0.24.0 milestone Nov 29, 2018
@eoveson
Copy link
Contributor Author

eoveson commented Nov 29, 2018

@gfyoung , it turns out that asv test I added previously was not exercising the code (I should have been passing in a Series rather than Categorical to the constructor). I added a new asv test for this (but left the other one since it could still be useful). I re-ran asv, and did see a significant difference reported in that newly added test and one other test. I didn't expect that other test to change, so I re-ran the same command and looking at the numbers that test doesn't change much. (However, the reporting no longer says my newly added test shows significant difference in the second run of the command, even though I do see the same difference from the first run of the command). So I think things are ok now, but pasting the output here so you can take a look.

Here is the first execution of the command (and then down below you will see the second one):

$ asv continuous -f 1.1 upstream/master category-perf -b categorical
▒ Creating environments
▒ Discovering benchmarks
▒▒ Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
▒▒ Building 9e270e9 for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
▒▒ Installing 9e270e9 into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
▒ Running 66 total benchmarks (2 commits * 1 environments * 33 benchmarks)
[ 0.00%] ▒ For pandas commit 3e01c38 <master^2> (round 1/2):
[ 0.00%] ▒▒ Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ▒▒ Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.76%] ▒▒▒ Setting up algorithms.py:83 ok
[ 0.76%] ▒▒▒ Running (algorithms.Hashing.time_series_categorical--)...
[ 3.03%] ▒▒▒ Running (categoricals.CategoricalSlicing.time_getitem_list_like--)..
[ 4.55%] ▒▒▒ Running (categoricals.CategoricalSlicing.time_getitem_slice--)..
[ 6.06%] ▒▒▒ Running (categoricals.Concat.time_union--).................
[ 18.94%] ▒▒▒ Running (categoricals.Rank.time_rank_int--).........
[ 25.00%] ▒ For pandas commit 9e270e9 (round 1/2):
[ 25.00%] ▒▒ Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ▒▒ Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.76%] ▒▒▒ Setting up algorithms.py:83 ok
[ 25.76%] ▒▒▒ Running (algorithms.Hashing.time_series_categorical--)..
[ 27.27%] ▒▒▒ Running (categoricals.CategoricalSlicing.time_getitem_list--).
[ 28.03%] ▒▒▒ Running (categoricals.CategoricalSlicing.time_getitem_list_like--)..
[ 29.55%] ▒▒▒ Running (categoricals.CategoricalSlicing.time_getitem_slice--)...
[ 31.82%] ▒▒▒ Running (categoricals.Constructor.time_all_nan--)................
[ 43.94%] ▒▒▒ Running (categoricals.Rank.time_rank_int--).........
[ 50.00%] ▒ For pandas commit 9e270e9 (round 2/2):
[ 50.00%] ▒▒ Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.76%] ▒▒▒ Setting up algorithms.py:83 ok
[ 50.76%] ▒▒▒ algorithms.Hashing.time_series_categorical 15.6▒8ms
[ 51.52%] ▒▒▒ ...s.CategoricalSlicing.time_getitem_bool_array ok
[ 51.52%] ▒▒▒ ================ ==========
index
---------------- ----------
monotonic_incr 2.60▒3ms
monotonic_decr 3.91▒0ms
non_monotonic 15.6▒0ms
================ ==========

[ 52.27%] ▒▒▒ ...oricals.CategoricalSlicing.time_getitem_list ok monotonic_incr 679~0us

[ 52.27%] ▒▒▒ ================ ============
index
---------------- ------------
monotonic_decr 1.12▒1ms
non_monotonic 0▒600000ns
================ ============

[ 53.03%] ▒▒▒ ...ls.CategoricalSlicing.time_getitem_list_like ok monotonic_incr 12.55us
monotonic_decr 14.1
0us
non_monotonic 14.1~0us

[ 53.03%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 53.79%] ▒▒▒ ...icals.CategoricalSlicing.time_getitem_scalar ok monotonic_incr 5.052us
monotonic_decr 5.73
0us
non_monotonic 5.04~0us

[ 53.79%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 54.55%] ▒▒▒ ...ricals.CategoricalSlicing.time_getitem_slice ok monotonic_incr 3.934us
monotonic_decr 4.27
4us
non_monotonic 8.45~0us

[ 54.55%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 55.30%] ▒▒▒ categoricals.Concat.time_concat 7.81▒0ms
[ 56.06%] ▒▒▒ categoricals.Concat.time_union 15.6▒8ms
[ 56.82%] ▒▒▒ categoricals.Constructor.time_all_nan 31.2▒0ms
[ 57.58%] ▒▒▒ categoricals.Constructor.time_datetimes 1.30▒0.7ms
[ 58.33%] ▒▒▒ ...goricals.Constructor.time_datetimes_with_nat 1.30▒0ms
[ 59.09%] ▒▒▒ ...ricals.Constructor.time_existing_categorical 3.12▒1ms
[ 59.85%] ▒▒▒ categoricals.Constructor.time_existing_series 42.220us
[ 60.61%] ▒▒▒ categoricals.Constructor.time_fastpath 460
200us
[ 61.36%] ▒▒▒ ...oricals.Constructor.time_from_codes_all_int8 347200us
[ 62.12%] ▒▒▒ categoricals.Constructor.time_regular 46.9▒6ms
[ 62.88%] ▒▒▒ categoricals.Constructor.time_with_nan 156▒0ms
[ 63.64%] ▒▒▒ categoricals.Contains.time_categorical_contains 78.9
0us
[ 64.39%] ▒▒▒ ...als.Contains.time_categorical_index_contains 3.460us
[ 65.15%] ▒▒▒ ...me_categorical_index_is_monotonic_decreasing 450▒0ns
[ 65.91%] ▒▒▒ ...me_categorical_index_is_monotonic_increasing 457▒200ns
[ 66.67%] ▒▒▒ ...e_categorical_series_is_monotonic_decreasing 51.4
0us
[ 67.42%] ▒▒▒ ...e_categorical_series_is_monotonic_increasing 62.5~20us
[ 68.18%] ▒▒▒ categoricals.Isin.time_isin_categorical ok
[ 68.18%] ▒▒▒ ======== ==========
dtype
-------- ----------
object 15.6▒0ms
int64 15.6▒6ms
======== ==========

[ 68.94%] ▒▒▒ categoricals.Rank.time_rank_int 11.7▒4ms
[ 69.70%] ▒▒▒ categoricals.Rank.time_rank_int_cat 7.81▒0ms
[ 70.45%] ▒▒▒ categoricals.Rank.time_rank_int_cat_ordered 0▒8000000ns
[ 71.21%] ▒▒▒ categoricals.Rank.time_rank_string 172▒10ms
[ 71.97%] ▒▒▒ categoricals.Rank.time_rank_string_cat 15.6▒0ms
[ 72.73%] ▒▒▒ categoricals.Rank.time_rank_string_cat_ordered 15.6▒6ms
[ 73.48%] ▒▒▒ categoricals.Repr.time_rendering 744~0us
[ 74.24%] ▒▒▒ categoricals.SetCategories.time_set_categories 31.2▒8ms
[ 75.00%] ▒▒▒ categoricals.ValueCounts.time_value_counts ok
[ 75.00%] ▒▒▒ ======== ==========
dropna
-------- ----------
True 15.6▒0ms
False 15.6▒0ms
======== ==========

[ 75.00%] ▒ For pandas commit 3e01c38 <master^2> (round 2/2):
[ 75.00%] ▒▒ Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ▒▒ Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.76%] ▒▒▒ Setting up algorithms.py:83 ok
[ 75.76%] ▒▒▒ algorithms.Hashing.time_series_categorical 7.81▒3ms
[ 76.52%] ▒▒▒ ...s.CategoricalSlicing.time_getitem_bool_array ok
[ 76.52%] ▒▒▒ ================ ==========
index
---------------- ----------
monotonic_incr 5.21▒2ms
monotonic_decr 3.91▒1ms
non_monotonic 7.81▒4ms
================ ==========

[ 77.27%] ▒▒▒ ...oricals.CategoricalSlicing.time_getitem_list ok monotonic_incr 625200us
monotonic_decr 601
300us
non_monotonic 539~200us

[ 77.27%] ▒▒▒ ================ ===========
index
---------------- -----------
================ ===========

[ 78.03%] ▒▒▒ ...ls.CategoricalSlicing.time_getitem_list_like ok monotonic_incr 12.50us
monotonic_decr 6.42
6us
non_monotonic 11.4~0us

[ 78.03%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 78.79%] ▒▒▒ ...icals.CategoricalSlicing.time_getitem_scalar ok monotonic_decr 6.270us
non_monotonic 4.73
0us

[ 78.79%] ▒▒▒ ================ ==========
index
---------------- ----------
monotonic_incr 0▒0ns
================ ==========

[ 79.55%] ▒▒▒ ...ricals.CategoricalSlicing.time_getitem_slice ok monotonic_incr 7.820us
non_monotonic 8.52
0us

[ 79.55%] ▒▒▒ ================ ==========
index
---------------- ----------
monotonic_decr 0▒6000ns
================ ==========

[ 80.30%] ▒▒▒ categoricals.Concat.time_concat 7.81▒0ms
[ 81.06%] ▒▒▒ categoricals.Concat.time_union 7.81▒0ms
[ 81.82%] ▒▒▒ categoricals.Constructor.time_all_nan 31.2▒0ms
[ 82.58%] ▒▒▒ categoricals.Constructor.time_datetimes 1.42▒0ms
[ 83.33%] ▒▒▒ ...goricals.Constructor.time_datetimes_with_nat 1.30▒0ms
[ 84.09%] ▒▒▒ ...ricals.Constructor.time_existing_categorical 2.60▒0ms
[ 84.85%] ▒▒▒ categoricals.Constructor.time_existing_series 3.12▒0ms
[ 85.61%] ▒▒▒ categoricals.Constructor.time_fastpath 4600us
[ 86.36%] ▒▒▒ ...oricals.Constructor.time_from_codes_all_int8 434
200us
[ 87.12%] ▒▒▒ categoricals.Constructor.time_regular 46.9▒10ms
[ 87.88%] ▒▒▒ categoricals.Constructor.time_with_nan 148▒10ms
[ 88.64%] ▒▒▒ categoricals.Contains.time_categorical_contains 0▒40000ns
[ 89.39%] ▒▒▒ ...als.Contains.time_categorical_index_contains 2.840us
[ 90.15%] ▒▒▒ ...me_categorical_index_is_monotonic_decreasing 312▒0ns
[ 90.91%] ▒▒▒ ...me_categorical_index_is_monotonic_increasing 312▒0ns
[ 91.67%] ▒▒▒ ...e_categorical_series_is_monotonic_decreasing 47.3
0us
[ 92.42%] ▒▒▒ ...e_categorical_series_is_monotonic_increasing 56.8~0us
[ 93.18%] ▒▒▒ categoricals.Isin.time_isin_categorical ok
[ 93.18%] ▒▒▒ ======== ==========
dtype
-------- ----------
object 15.6▒0ms
int64 15.6▒8ms
======== ==========

[ 93.94%] ▒▒▒ categoricals.Rank.time_rank_int 7.81▒4ms
[ 94.70%] ▒▒▒ categoricals.Rank.time_rank_int_cat 7.81▒4ms
[ 95.45%] ▒▒▒ categoricals.Rank.time_rank_int_cat_ordered 7.81▒4ms
[ 96.21%] ▒▒▒ categoricals.Rank.time_rank_string 156▒10ms
[ 96.97%] ▒▒▒ categoricals.Rank.time_rank_string_cat 15.6▒0ms
[ 97.73%] ▒▒▒ categoricals.Rank.time_rank_string_cat_ordered 7.81▒0ms
[ 98.48%] ▒▒▒ categoricals.Repr.time_rendering 710~0us
[ 99.24%] ▒▒▒ categoricals.SetCategories.time_set_categories 31.2▒6ms
[100.00%] ▒▒▒ categoricals.ValueCounts.time_value_counts ok
[100.00%] ▒▒▒ ======== ==========
dropna
-------- ----------
True 15.6▒0ms
False 15.6▒0ms
======== ==========

  •   0~40000ns         78.9~0us      n/a  -        3.12~0ms        42.2~20us     0.01         before          after         ratio
    
    [3e01c38] [9e270e9]
    <master^2>
    categoricals.Contains.time_categorical_contains
    categoricals.Constructor.time_existing_series

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

$ asv continuous -f 1.1 upstream/master category-perf -b categorical
▒ Creating environments
▒ Discovering benchmarks
▒▒ Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
▒▒ Installing 9e270e9 into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
▒ Running 66 total benchmarks (2 commits * 1 environments * 33 benchmarks)
[ 0.00%] ▒ For pandas commit 3e01c38 <master^2> (round 1/2):
[ 0.00%] ▒▒ Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ▒▒ Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.76%] ▒▒▒ Setting up algorithms.py:83 ok
[ 0.76%] ▒▒▒ Running (algorithms.Hashing.time_series_categorical--)...
[ 3.03%] ▒▒▒ Running (categoricals.CategoricalSlicing.time_getitem_list_like--)..
[ 4.55%] ▒▒▒ Running (categoricals.CategoricalSlicing.time_getitem_slice--).....
[ 8.33%] ▒▒▒ Running (categoricals.Constructor.time_datetimes_with_nat--)..............
[ 18.94%] ▒▒▒ Running (categoricals.Rank.time_rank_int--).........
[ 25.00%] ▒ For pandas commit 9e270e9 (round 1/2):
[ 25.00%] ▒▒ Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ▒▒ Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.76%] ▒▒▒ Setting up algorithms.py:83 ok
[ 25.76%] ▒▒▒ Running (algorithms.Hashing.time_series_categorical--)...
[ 28.03%] ▒▒▒ Running (categoricals.CategoricalSlicing.time_getitem_list_like--)..
[ 29.55%] ▒▒▒ Running (categoricals.CategoricalSlicing.time_getitem_slice--).....
[ 33.33%] ▒▒▒ Running (categoricals.Constructor.time_datetimes_with_nat--)..............
[ 43.94%] ▒▒▒ Running (categoricals.Rank.time_rank_int--).........
[ 50.00%] ▒ For pandas commit 9e270e9 (round 2/2):
[ 50.00%] ▒▒ Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.76%] ▒▒▒ Setting up algorithms.py:83 ok
[ 50.76%] ▒▒▒ algorithms.Hashing.time_series_categorical 7.81▒0ms
[ 51.52%] ▒▒▒ ...s.CategoricalSlicing.time_getitem_bool_array ok
[ 51.52%] ▒▒▒ ================ ==========
index
---------------- ----------
monotonic_incr 3.91▒0ms
monotonic_decr 3.91▒0ms
non_monotonic 7.81▒0ms
================ ==========

[ 52.27%] ▒▒▒ ...oricals.CategoricalSlicing.time_getitem_list ok monotonic_incr 601300us
monotonic_decr 558
0us
non_monotonic 521~0us

[ 52.27%] ▒▒▒ ================ ===========
index
---------------- -----------
================ ===========

[ 53.03%] ▒▒▒ ...ls.CategoricalSlicing.time_getitem_list_like ok monotonic_incr 6.416us
monotonic_decr 11.7
0us
non_monotonic 12.8~6us

[ 53.03%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 53.79%] ▒▒▒ ...icals.CategoricalSlicing.time_getitem_scalar ok monotonic_incr 4.510us
monotonic_decr 4.31
2us
non_monotonic 4.55~2us

[ 53.79%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 54.55%] ▒▒▒ ...ricals.CategoricalSlicing.time_getitem_slice ok monotonic_incr 7.163us
monotonic_decr 7.15
0us
non_monotonic 7.69~0us

[ 54.55%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 55.30%] ▒▒▒ categoricals.Concat.time_concat 7.81▒0ms
[ 56.06%] ▒▒▒ categoricals.Concat.time_union 7.81▒0ms
[ 56.82%] ▒▒▒ categoricals.Constructor.time_all_nan 31.2▒0ms
[ 57.58%] ▒▒▒ categoricals.Constructor.time_datetimes 1.20▒0.5ms
[ 58.33%] ▒▒▒ ...goricals.Constructor.time_datetimes_with_nat 1.30▒0ms
[ 59.09%] ▒▒▒ ...ricals.Constructor.time_existing_categorical 3.12▒0ms
[ 59.85%] ▒▒▒ categoricals.Constructor.time_existing_series 42.20us
[ 60.61%] ▒▒▒ categoricals.Constructor.time_fastpath 355
0us
[ 61.36%] ▒▒▒ ...oricals.Constructor.time_from_codes_all_int8 446200us
[ 62.12%] ▒▒▒ categoricals.Constructor.time_regular 46.9▒6ms
[ 62.88%] ▒▒▒ categoricals.Constructor.time_with_nan 141▒0ms
[ 63.64%] ▒▒▒ categoricals.Contains.time_categorical_contains 63.3
0us
[ 64.39%] ▒▒▒ ...als.Contains.time_categorical_index_contains 2.720us
[ 65.15%] ▒▒▒ ...me_categorical_index_is_monotonic_decreasing 383▒100ns
[ 65.91%] ▒▒▒ ...me_categorical_index_is_monotonic_increasing 323▒0ns
[ 66.67%] ▒▒▒ ...e_categorical_series_is_monotonic_decreasing 51.1
20us
[ 67.42%] ▒▒▒ ...e_categorical_series_is_monotonic_increasing 46.6~0us
[ 68.18%] ▒▒▒ categoricals.Isin.time_isin_categorical ok
[ 68.18%] ▒▒▒ ======== ==========
dtype
-------- ----------
object 15.6▒0ms
int64 15.6▒0ms
======== ==========

[ 68.94%] ▒▒▒ categoricals.Rank.time_rank_int 7.81▒3ms
[ 69.70%] ▒▒▒ categoricals.Rank.time_rank_int_cat 7.81▒3ms
[ 70.45%] ▒▒▒ categoricals.Rank.time_rank_int_cat_ordered 7.81▒0ms
[ 71.21%] ▒▒▒ categoricals.Rank.time_rank_string 141▒8ms
[ 71.97%] ▒▒▒ categoricals.Rank.time_rank_string_cat 15.6▒0ms
[ 72.73%] ▒▒▒ categoricals.Rank.time_rank_string_cat_ordered 7.81▒0ms
[ 73.48%] ▒▒▒ categoricals.Repr.time_rendering 710~0us
[ 74.24%] ▒▒▒ categoricals.SetCategories.time_set_categories 15.6▒6ms
[ 75.00%] ▒▒▒ categoricals.ValueCounts.time_value_counts ok
[ 75.00%] ▒▒▒ ======== ==========
dropna
-------- ----------
True 15.6▒0ms
False 15.6▒0ms
======== ==========

[ 75.00%] ▒ For pandas commit 3e01c38 <master^2> (round 2/2):
[ 75.00%] ▒▒ Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ▒▒ Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.76%] ▒▒▒ Setting up algorithms.py:83 ok
[ 75.76%] ▒▒▒ algorithms.Hashing.time_series_categorical 7.81▒0ms
[ 76.52%] ▒▒▒ ...s.CategoricalSlicing.time_getitem_bool_array ok
[ 76.52%] ▒▒▒ ================ ==========
index
---------------- ----------
monotonic_incr 5.21▒2ms
monotonic_decr 3.91▒0ms
non_monotonic 7.81▒0ms
================ ==========

[ 77.27%] ▒▒▒ ...oricals.CategoricalSlicing.time_getitem_list ok monotonic_incr 601200us
monotonic_decr 539
0us
non_monotonic 539~200us

[ 77.27%] ▒▒▒ ================ ===========
index
---------------- -----------
================ ===========

[ 78.03%] ▒▒▒ ...ls.CategoricalSlicing.time_getitem_list_like ok monotonic_incr 11.40us
monotonic_decr 12.5
5us
non_monotonic 12.7~0us

[ 78.03%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 78.79%] ▒▒▒ ...icals.CategoricalSlicing.time_getitem_scalar ok monotonic_incr 4.370us
monotonic_decr 4.70
2us
non_monotonic 4.70~0us

[ 78.79%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 79.55%] ▒▒▒ ...ricals.CategoricalSlicing.time_getitem_slice ok monotonic_incr 7.230us
monotonic_decr 7.10
0us
non_monotonic 7.84~0us

[ 79.55%] ▒▒▒ ================ ==========
index
---------------- ----------
================ ==========

[ 80.30%] ▒▒▒ categoricals.Concat.time_concat 7.81▒0ms
[ 81.06%] ▒▒▒ categoricals.Concat.time_union 7.81▒0ms
[ 81.82%] ▒▒▒ categoricals.Constructor.time_all_nan 31.2▒6ms
[ 82.58%] ▒▒▒ categoricals.Constructor.time_datetimes 1.30▒0.5ms
[ 83.33%] ▒▒▒ ...goricals.Constructor.time_datetimes_with_nat 1.30▒0ms
[ 84.09%] ▒▒▒ ...ricals.Constructor.time_existing_categorical 3.12▒0ms
[ 84.85%] ▒▒▒ categoricals.Constructor.time_existing_series 3.12▒2ms
[ 85.61%] ▒▒▒ categoricals.Constructor.time_fastpath 411200us
[ 86.36%] ▒▒▒ ...oricals.Constructor.time_from_codes_all_int8 347
0us
[ 87.12%] ▒▒▒ categoricals.Constructor.time_regular 46.9▒8ms
[ 87.88%] ▒▒▒ categoricals.Constructor.time_with_nan 141▒0ms
[ 88.64%] ▒▒▒ categoricals.Contains.time_categorical_contains 71.70us
[ 89.39%] ▒▒▒ ...als.Contains.time_categorical_index_contains 2.78
1us
[ 90.15%] ▒▒▒ ...me_categorical_index_is_monotonic_decreasing 316▒100ns
[ 90.91%] ▒▒▒ ...me_categorical_index_is_monotonic_increasing 343▒100ns
[ 91.67%] ▒▒▒ ...e_categorical_series_is_monotonic_decreasing 52.10us
[ 92.42%] ▒▒▒ ...e_categorical_series_is_monotonic_increasing 46.8
20us
[ 93.18%] ▒▒▒ categoricals.Isin.time_isin_categorical ok
[ 93.18%] ▒▒▒ ======== ==========
dtype
-------- ----------
object 15.6▒0ms
int64 15.6▒6ms
======== ==========

[ 93.94%] ▒▒▒ categoricals.Rank.time_rank_int 7.81▒3ms
[ 94.70%] ▒▒▒ categoricals.Rank.time_rank_int_cat 7.81▒4ms
[ 95.45%] ▒▒▒ categoricals.Rank.time_rank_int_cat_ordered 7.81▒0ms
[ 96.21%] ▒▒▒ categoricals.Rank.time_rank_string 141▒8ms
[ 96.97%] ▒▒▒ categoricals.Rank.time_rank_string_cat 15.6▒0ms
[ 97.73%] ▒▒▒ categoricals.Rank.time_rank_string_cat_ordered 7.81▒3ms
[ 98.48%] ▒▒▒ categoricals.Repr.time_rendering 679~300us
[ 99.24%] ▒▒▒ categoricals.SetCategories.time_set_categories 31.2▒0ms
[100.00%] ▒▒▒ categoricals.ValueCounts.time_value_counts ok
[100.00%] ▒▒▒ ======== ==========
dropna
-------- ----------
True 15.6▒0ms
False 15.6▒0ms
======== ==========

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@jreback
Copy link
Contributor

jreback commented Nov 29, 2018

@eoveson can you show a before / after using timeit in ipython

@eoveson
Copy link
Contributor Author

eoveson commented Nov 29, 2018

@eoveson can you show a before / after using timeit in ipython

For sure. Before my change:

In [2]: s = pd.Series(list('abcd') * 1000000).astype('category')

In [3]: %timeit s == 'a'
25.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.29 ms ± 70.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

After change:

In [6]: s = pd.Series(list('abcd') * 1000000).astype('category')

In [7]: %timeit s == 'a'
5.24 ms ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.28 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jreback jreback merged commit bdeddb1 into pandas-dev:master Nov 30, 2018
@jreback
Copy link
Contributor

jreback commented Nov 30, 2018

thanks @eoveson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

equality comparison with a scalar is slow for category (performance regression)
5 participants