Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-45885: Specialize COMPARE_OP #29734

Merged
merged 11 commits into from
Dec 3, 2021
Merged

bpo-45885: Specialize COMPARE_OP #29734

merged 11 commits into from
Dec 3, 2021

Conversation

sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Nov 23, 2021

@sweeneyde

This comment has been minimized.

@sweeneyde

This comment has been minimized.

@sweeneyde

This comment has been minimized.

@markshannon
Copy link
Member

You microbenchmark is only testing <. What happens if you have a mix of comparisons?

Python/ceval.c Outdated
PyObject *left = SECOND();
DEOPT_IF(!PyUnicode_CheckExact(left), COMPARE_OP);
DEOPT_IF(!PyUnicode_CheckExact(right), COMPARE_OP);
DEOPT_IF(!PyUnicode_IS_READY(left), COMPARE_OP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? If left == right then they are equal regardless of whether the string is ready or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that almost all strings are ready. I suppose the alternative is

            DEOPT_IF(!PyUnicode_CheckExact(left), COMPARE_OP);
            DEOPT_IF(!PyUnicode_CheckExact(right), COMPARE_OP);
            int res = 1;
            if (left != right) {
                DEOPT_IF(!PyUnicode_IS_READY(left), COMPARE_OP);
                DEOPT_IF(!PyUnicode_IS_READY(left), COMPARE_OP);
                res = comare_the_strings_assuming_theyre_ready(left, right);
            }

is that what you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Assuming that the identity shortcut is worthwhile, then it makes sense to shortcut as much work as possible.
You are also missing out of the quick inequality test using lengths len(a) != len(b) implies that a != b.

Maybe wrap unicode_compare_eq in an equality function that readies the strings (I'm surprised that no such thing exists) and use that?

int res = 1;
if (left != right) {
    res = _PyUnicode_Equal(left, right);
}

Python/ceval.c Outdated Show resolved Hide resolved
Python/specialize.c Outdated Show resolved Hide resolved
Python/specialize.c Outdated Show resolved Hide resolved
@markshannon
Copy link
Member

There are a lot of branches in the int and float specializations, which is not good for performance.
https://github.com/python/cpython/blob/main/Python/adaptive.md#implementation-of-specialized-instructions

Because neither comparison can fail (ints have a total ordering, and floats have a total ordering plus a special not-equal-if-either-is-a-nan result) we can avoid the branching by masking the result of a comparison with a code for the operator.
E.g. res = (float_compare() & OPERATOR_MASK) ? Py_True : Py_False

Also see my comment on the issue about avoiding the need to push and pop objects.

@markshannon markshannon self-assigned this Nov 24, 2021
@sweeneyde

This comment has been minimized.

@sweeneyde
Copy link
Member Author

Here's one idea for ints:

cmp = ...
// same as sign = (cmp > 0) - (cmp < 0)
int sign = (int)((cmp>>63) - ((-cmp)>>63));
int flags = 1 << (sign + 1);
int res = flags & mask;
// mask: 0b001 is negative, 0b010 is zero, 0b100 is positive, etc.
next_instruction()
if (res) JUMP_BY(oparg);

Then the mask can be chosen depending on whether the following instruction is pop_jump_if_true/false.

There would still be the branch depending on whether Py_SIZE(left) - Py_SIZE(right) == 0 and a loop over digits and a branch to check for Py_SIZE(left) < 0, but at least it would save the switch over oparg, the push/pop/incref/decref, and one DISPATCH().

Maybe the float equivalent would deopt if either arg is NaN and then do int sign = (dleft > dright) - (dleft < dright).

@sweeneyde
Copy link
Member Author

These three combined opcodes have slightly worse hit percentages on some benchmarks.

@sweeneyde
Copy link
Member Author

sweeneyde commented Nov 28, 2021

Newer Pyperformance numbers (Note that my setup is not very stable):

Slower (20):

  • unpickle: 16.4 us +- 1.5 us -> 17.8 us +- 0.9 us: 1.08x slower
  • pickle_list: 4.28 us +- 0.09 us -> 4.63 us +- 0.13 us: 1.08x slower
  • scimark_sparse_mat_mult: 5.49 ms +- 0.21 ms -> 5.89 ms +- 0.16 ms: 1.07x slower
  • pickle: 12.3 us +- 0.9 us -> 13.0 us +- 1.1 us: 1.06x slower
  • regex_dna: 205 ms +- 3 ms -> 217 ms +- 5 ms: 1.06x slower
  • meteor_contest: 108 ms +- 3 ms -> 112 ms +- 2 ms: 1.04x slower
  • pickle_dict: 28.4 us +- 0.2 us -> 29.5 us +- 0.7 us: 1.04x slower
  • python_startup: 8.00 ms +- 0.06 ms -> 8.25 ms +- 0.50 ms: 1.03x slower
  • logging_silent: 114 ns +- 1 ns -> 118 ns +- 2 ns: 1.03x slower
  • regex_effbot: 3.16 ms +- 0.02 ms -> 3.25 ms +- 0.03 ms: 1.03x slower
  • nqueens: 95.2 ms +- 1.9 ms -> 97.7 ms +- 1.8 ms: 1.03x slower
  • unpickle_list: 4.97 us +- 0.11 us -> 5.10 us +- 0.10 us: 1.03x slower
  • nbody: 101 ms +- 5 ms -> 103 ms +- 3 ms: 1.03x slower
  • python_startup_no_site: 6.01 ms +- 0.04 ms -> 6.16 ms +- 0.27 ms: 1.03x slower
  • mako: 12.3 ms +- 0.1 ms -> 12.5 ms +- 0.1 ms: 1.02x slower
  • logging_simple: 7.51 us +- 0.13 us -> 7.67 us +- 0.16 us: 1.02x slower
  • spectral_norm: 114 ms +- 4 ms -> 116 ms +- 2 ms: 1.02x slower
  • scimark_fft: 378 ms +- 7 ms -> 385 ms +- 12 ms: 1.02x slower
  • tornado_http: 183 ms +- 7 ms -> 186 ms +- 6 ms: 1.02x slower
  • pathlib: 20.6 ms +- 0.3 ms -> 20.8 ms +- 0.4 ms: 1.01x slower

Faster (17):

  • scimark_sor: 153 ms +- 4 ms -> 142 ms +- 4 ms: 1.08x faster
  • unpack_sequence: 43.0 ns +- 1.0 ns -> 41.0 ns +- 0.8 ns: 1.05x faster
  • go: 166 ms +- 2 ms -> 159 ms +- 3 ms: 1.04x faster
  • django_template: 44.4 ms +- 1.8 ms -> 42.6 ms +- 2.5 ms: 1.04x faster
  • fannkuch: 433 ms +- 10 ms -> 419 ms +- 10 ms: 1.03x faster
  • pidigits: 183 ms +- 1 ms -> 178 ms +- 1 ms: 1.03x faster
  • scimark_monte_carlo: 78.5 ms +- 1.7 ms -> 76.5 ms +- 1.1 ms: 1.03x faster
  • chameleon: 7.69 ms +- 0.13 ms -> 7.50 ms +- 0.10 ms: 1.03x faster
  • telco: 6.96 ms +- 0.13 ms -> 6.83 ms +- 0.19 ms: 1.02x faster
  • hexiom: 7.49 ms +- 0.13 ms -> 7.35 ms +- 0.18 ms: 1.02x faster
  • pickle_pure_python: 394 us +- 6 us -> 387 us +- 9 us: 1.02x faster
  • deltablue: 4.98 ms +- 0.10 ms -> 4.90 ms +- 0.14 ms: 1.02x faster
  • scimark_lu: 118 ms +- 2 ms -> 116 ms +- 2 ms: 1.02x faster
  • 2to3: 282 ms +- 4 ms -> 278 ms +- 4 ms: 1.01x faster
  • float: 84.2 ms +- 1.2 ms -> 83.1 ms +- 1.6 ms: 1.01x faster
  • unpickle_pure_python: 298 us +- 7 us -> 294 us +- 6 us: 1.01x faster
  • pyflate: 550 ms +- 14 ms -> 545 ms +- 8 ms: 1.01x faster

Benchmark hidden because not significant (18): chaos, crypto_pyaes, dulwich_log, json_dumps, json_loads, logging_format, raytrace, regex_compile, regex_v8, richards, sympy_expand, sympy_integrate, sympy_sum, sympy_str, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00x slower

@sweeneyde
Copy link
Member Author

Newest microbenchmarks: pretty good, even including more comparisons

from pyperf import Runner
runner = Runner()

runner.timeit(
    "float_loop",
    setup="arr = [float(x) for x in range(5_000_000)]",
    stmt="for x in arr:\n"
         "    if x > 1e6 or x == 1e6 or x < 0.0:\n"
         "        break\n",
)

runner.timeit(
    "int_loop",
    setup="arr = list(range(5_000_000))",
    stmt="for i in arr:\n"
         "    if i > 6_000_000 or i == 6_000_000 or i < 0:\n"
         "        break\n",
)

runner.timeit(
    "str_loop",
    setup="arr = ['Py'] * 5_000_000",
    stmt="for word in arr:\n"
         "    if word == 'Python' or word != 'Py':\n"
         "        break\n",
)

"""
float_loop: Mean +- std dev: [micro_main] 41.5 ms +- 0.4 ms -> [micro_combined3] 23.7 ms +- 0.8 ms: 1.75x faster
int_loop: Mean +- std dev: [micro_main] 202 ms +- 2 ms -> [micro_combined3] 133 ms +- 1 ms: 1.52x faster
str_loop: Mean +- std dev: [micro_main] 140 ms +- 7 ms -> [micro_combined3] 88.1 ms +- 3.9 ms: 1.59x faster

Geometric mean: 1.62x faster
"""

@sweeneyde

This comment has been minimized.

@sweeneyde
Copy link
Member Author

Slower (19):

  • pickle_dict: 27.9 us +- 0.2 us -> 29.9 us +- 0.2 us: 1.07x slower
  • scimark_sparse_mat_mult: 5.28 ms +- 0.30 ms -> 5.63 ms +- 0.12 ms: 1.07x slower
  • nbody: 92.6 ms +- 2.6 ms -> 98.6 ms +- 2.1 ms: 1.06x slower
  • pickle_list: 4.28 us +- 0.11 us -> 4.51 us +- 0.11 us: 1.05x slower
  • regex_v8: 23.5 ms +- 0.5 ms -> 24.5 ms +- 0.4 ms: 1.04x slower
  • unpack_sequence: 41.3 ns +- 0.7 ns -> 42.9 ns +- 1.1 ns: 1.04x slower
  • sympy_str: 352 ms +- 12 ms -> 364 ms +- 6 ms: 1.03x slower
  • scimark_fft: 377 ms +- 10 ms -> 390 ms +- 7 ms: 1.03x slower
  • django_template: 44.4 ms +- 1.8 ms -> 45.7 ms +- 1.6 ms: 1.03x slower
  • json_dumps: 13.1 ms +- 0.3 ms -> 13.5 ms +- 0.3 ms: 1.03x slower
  • raytrace: 370 ms +- 13 ms -> 380 ms +- 12 ms: 1.03x slower
  • spectral_norm: 120 ms +- 3 ms -> 123 ms +- 2 ms: 1.03x slower
  • sympy_expand: 590 ms +- 21 ms -> 603 ms +- 15 ms: 1.02x slower
  • scimark_lu: 119 ms +- 5 ms -> 121 ms +- 4 ms: 1.02x slower
  • sympy_sum: 193 ms +- 3 ms -> 196 ms +- 2 ms: 1.02x slower
  • regex_effbot: 3.35 ms +- 0.03 ms -> 3.39 ms +- 0.04 ms: 1.01x slower
  • nqueens: 94.2 ms +- 2.4 ms -> 95.1 ms +- 1.6 ms: 1.01x slower
  • meteor_contest: 109 ms +- 2 ms -> 110 ms +- 2 ms: 1.01x slower
  • pathlib: 20.4 ms +- 0.4 ms -> 20.6 ms +- 0.3 ms: 1.01x slower

Faster (20):

  • scimark_sor: 153 ms +- 4 ms -> 136 ms +- 2 ms: 1.13x faster
  • pyflate: 554 ms +- 13 ms -> 518 ms +- 13 ms: 1.07x faster
  • scimark_monte_carlo: 78.8 ms +- 2.1 ms -> 75.3 ms +- 0.9 ms: 1.05x faster
  • deltablue: 5.06 ms +- 0.11 ms -> 4.86 ms +- 0.12 ms: 1.04x faster
  • fannkuch: 439 ms +- 6 ms -> 425 ms +- 19 ms: 1.03x faster
  • unpickle_pure_python: 292 us +- 4 us -> 284 us +- 5 us: 1.03x faster
  • json_loads: 28.8 us +- 2.0 us -> 28.1 us +- 0.8 us: 1.02x faster
  • unpickle_list: 4.98 us +- 0.11 us -> 4.87 us +- 0.12 us: 1.02x faster
  • mako: 12.6 ms +- 0.2 ms -> 12.4 ms +- 0.2 ms: 1.02x faster
  • pickle_pure_python: 392 us +- 13 us -> 385 us +- 9 us: 1.02x faster
  • 2to3: 288 ms +- 14 ms -> 284 ms +- 7 ms: 1.02x faster
  • xml_etree_iterparse: 112 ms +- 3 ms -> 110 ms +- 2 ms: 1.02x faster
  • crypto_pyaes: 92.5 ms +- 2.6 ms -> 91.1 ms +- 2.3 ms: 1.02x faster
  • go: 165 ms +- 3 ms -> 162 ms +- 2 ms: 1.01x faster
  • hexiom: 7.55 ms +- 0.17 ms -> 7.45 ms +- 0.14 ms: 1.01x faster
  • logging_simple: 7.67 us +- 0.16 us -> 7.59 us +- 0.14 us: 1.01x faster
  • dulwich_log: 83.4 ms +- 1.1 ms -> 82.6 ms +- 1.2 ms: 1.01x faster
  • python_startup: 8.16 ms +- 0.10 ms -> 8.09 ms +- 0.17 ms: 1.01x faster
  • logging_silent: 116 ns +- 2 ns -> 115 ns +- 2 ns: 1.01x faster
  • chameleon: 7.51 ms +- 0.10 ms -> 7.46 ms +- 0.09 ms: 1.01x faster

Benchmark hidden because not significant (16): chaos, float, logging_format, pickle, pidigits, python_startup_no_site, regex_compile, regex_dna, richards, sympy_integrate, telco, tornado_http, unpickle, xml_etree_parse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00x slower

@markshannon
Copy link
Member

My results are less noisy and show a ~1% speedup.

Given that some of the largest improvements are from notoriously noisy benchmarks, 1% may be overstating the speedup but it does seem to be real.

Python/ceval.c Outdated Show resolved Hide resolved
double dleft = PyFloat_AS_DOUBLE(left);
double dright = PyFloat_AS_DOUBLE(right);
int sign = (dleft > dright) - (dleft < dright);
DEOPT_IF(isnan(dleft), COMPARE_OP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we could use the fact that nans are not equal to everything including themselves to add a fourth bit to the mask. Probably leave that for another PR as the maths starts getting a bit convoluted.

Python/specialize.c Outdated Show resolved Hide resolved
goto success;
}
}
SPECIALIZATION_FAIL(COMPARE_OP, SPEC_FAIL_OTHER);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like more information here.
Is this a builtin class, or a Python class?
Does it override the comparison, or (in the case of ==/!==) does it rely on the default identity comparison?

Doesn't need to be in this PR, though.

@markshannon
Copy link
Member

Looks good in general. The approach seems sound and gives a speedup.

There a couple of things that need to be tightened up, and a couple of possible enhancements (for future PRs).

@markshannon
Copy link
Member

Updated results are good.

@markshannon
Copy link
Member

👍

@markshannon markshannon merged commit 03768c4 into python:main Dec 3, 2021
@bedevere-bot
Copy link

bedevere-bot commented Dec 3, 2021

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 Arch Linux Asan Debug 3.x has failed when building commit 03768c4.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/585/builds/782) and take a look at the build logs.
  4. Check if the failure is related to this commit (03768c4) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/585/builds/782

Summary of the results of the build (if available):

Click to see traceback logs
remote: Enumerating objects: 21, done.        
remote: Counting objects:   5% (1/20)        
remote: Counting objects:  10% (2/20)        
remote: Counting objects:  15% (3/20)        
remote: Counting objects:  20% (4/20)        
remote: Counting objects:  25% (5/20)        
remote: Counting objects:  30% (6/20)        
remote: Counting objects:  35% (7/20)        
remote: Counting objects:  40% (8/20)        
remote: Counting objects:  45% (9/20)        
remote: Counting objects:  50% (10/20)        
remote: Counting objects:  55% (11/20)        
remote: Counting objects:  60% (12/20)        
remote: Counting objects:  65% (13/20)        
remote: Counting objects:  70% (14/20)        
remote: Counting objects:  75% (15/20)        
remote: Counting objects:  80% (16/20)        
remote: Counting objects:  85% (17/20)        
remote: Counting objects:  90% (18/20)        
remote: Counting objects:  95% (19/20)        
remote: Counting objects: 100% (20/20)        
remote: Counting objects: 100% (20/20), done.        
remote: Compressing objects:  10% (1/10)        
remote: Compressing objects:  20% (2/10)        
remote: Compressing objects:  30% (3/10)        
remote: Compressing objects:  40% (4/10)        
remote: Compressing objects:  50% (5/10)        
remote: Compressing objects:  60% (6/10)        
remote: Compressing objects:  70% (7/10)        
remote: Compressing objects:  80% (8/10)        
remote: Compressing objects:  90% (9/10)        
remote: Compressing objects: 100% (10/10)        
remote: Compressing objects: 100% (10/10), done.        
remote: Total 21 (delta 10), reused 11 (delta 10), pack-reused 1        
From https://github.com/python/cpython
 * branch                  main       -> FETCH_HEAD
Note: switching to '03768c4d139df46212a091ed931aad03bec18b57'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 03768c4d13 [bpo-45885](https://bugs.python.org/issue45885): Specialize COMPARE_OP (GH-29734)
Switched to and reset branch 'main'

renaming build/scripts-3.11/pydoc3 to build/scripts-3.11/pydoc3.11
renaming build/scripts-3.11/idle3 to build/scripts-3.11/idle3.11
renaming build/scripts-3.11/2to3 to build/scripts-3.11/2to3-3.11

renaming build/scripts-3.11/pydoc3 to build/scripts-3.11/pydoc3.11
renaming build/scripts-3.11/idle3 to build/scripts-3.11/idle3.11
renaming build/scripts-3.11/2to3 to build/scripts-3.11/2to3-3.11

renaming build/scripts-3.11/pydoc3 to build/scripts-3.11/pydoc3.11
renaming build/scripts-3.11/idle3 to build/scripts-3.11/idle3.11
renaming build/scripts-3.11/2to3 to build/scripts-3.11/2to3-3.11
make: *** [Makefile:1746: buildbottest] Terminated

Cannot open file '/buildbot/buildarea/3.x.pablogsal-arch-x86_64.asan_debug/build/test-results.xml' for upload

@sweeneyde sweeneyde deleted the compare branch December 3, 2021 21:31
@joaoe
Copy link

joaoe commented Feb 27, 2022

@sweeneyde
Hi. I just stumbled upon this PR.
Now in Objects/unicodeobject.c there is both a _PyUnicode_EQ and _PyUnicode_Equal. Is that on purpouse ?
It looks a bit redundant, also because the former _PyUnicode_Equal uses unicode_eq() which does not do a pointer check before the more expensive memcmp. Meanwhile your new function uses the local function unicode_compare_eq which seems a duplicate of unicode_eq. Opportunity for some cleanup ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants