MSVC <bit> #795

barcharcraz · 2020-05-05T21:44:59Z

enables <bit> functions on MSVC.

_Countr_zero has moved to <limits> to support numeric
still using static write race method to detect instruction support. I will push a revision to this PR using isa_available once the internal PR is up to add the bits needed to isa_available
less sophisticated on ARM than on x86 since the arm instructions relevant are NEON and can wait.
does not omit runtime checks if /arch indicates the instruction is always available. This is because all three instructions I use are indicated by their own CPUID and are not necessarily present on all CPUs supporting a given /arch. I could do the historical research to figure out which /arch is high enough, but the benefit isn't huge (it's one branch, that always goes the same way).
also adds a new set of test cases for 64bit values where the top half is ones and the bottom zeros or vice versa.

stl/inc/bit

stl/inc/limits

stl/inc/yvals_core.h

tests/std/tests/P0553R4_bit_rotating_and_counting_functions/test.cpp

stl/inc/bit

stl/inc/limits

stl/inc/bit

stl/inc/limits

pcordes · 2020-05-06T08:05:36Z

does not omit runtime checks if /arch indicates the instruction is always available. This is because all three instructions I use are indicated by their own CPUID and are not necessarily present on all CPUs supporting a given /arch. I could do the historical research to figure out which /arch is high enough, but the benefit isn't huge (it's one branch, that always goes the same way).

popcnt does have its own CPU feature flag, but SSE4.2 is assumed to imply popcnt. On Intel they were introduced at the same time (Nehalem). Every real CPU with SSE4.2 has popcnt, across all vendors.

In fact, GCC's -msse4.2 implies -mpopcnt, so doing the same thing in MSVC would not be surprising or weird https://godbolt.org/z/hspLqb. You can specify -mpopcnt separately to enable popcnt without SSE4.2 SIMD instructions, or disable it separately, like perhaps for kernel code where you want to allow popcnt but not SIMD instructions.

Later ISA extensions like AVX imply SSE4.2 and thus in practice popcnt, so any /arch:AVX* should also imply popcnt.

(Technically, I think Intel's documentation only specifies that the AVX1 feature flag implies that the VEX encoding of instructions like vpcmpestri are supported; at least that's the way Intel's ISA manual specifies it: https://www.felixcloutier.com/x86/pcmpestri. but in practice software can and does assume that Intel CPU features are incremental and a newer one like AVX2 implies all older ones like AVX1 and SSE*.)

Some AMD-only features like XOP and FMA4 have come and gone, and AMD SSE4a isn't implied by AVX on Intel CPUs. But Intel features are incremental and can be seen a "feature levels" not just feature bitmaps. And the coupling between SSE feature levels and bit-manipulation is that some software already does assume that SSE4.2 implies popcnt.

I'm pretty sure any virtual machine / emulator that indicated SSE4.2 support via CPUID but faulted on popcnt would not be able to run some real-world software, therefore it's safe for compilers to make the assumption that no such systems exist, or don't need to be supported. (i.e. they're not real x86 systems).

For the record, I don't know if Intel documents anywhere that later CPUID feature bits imply earlier ones, especially not the SSE4.2 / popcnt thing. So we'd be relying on widely-established practice here, probably not explicit documentation.

AlexGuteniev · 2020-05-06T11:38:39Z

I think that for tlcnt / lzcnt feature-check and handling zero should be done after the counting instruction. This way it is possible to convert potentially mispredicted jump to cmov*:

#include <intrin.h>

bool supported = false;

unsigned lzcnt_check_before(unsigned val) {
    if (!supported && val == 0) { // val == 0 comparison is JCC
        return 32;
    }
    unsigned result = __lzcnt(val);
    return result;
}

unsigned lzcnt_check_after(unsigned val) {
    unsigned result = __lzcnt(val);
    if (!supported && val == 0)  {  // val == 0 comparison is CMOV
        result = 32;
    }
    return result;
}

Or possibly even just perform static check, but no runtime check and have no conditional jumps:

unsigned lzcnt_check_after_unconditionally(unsigned val) {
    unsigned result = __lzcnt(val);
#ifndef LZCNT_AVAILABLE
    if (val == 0)  {
        result = 32;
    }
#endif
    return result;
}

https://godbolt.org/z/dwaySr

pcordes · 2020-05-06T11:53:25Z

I think that for tlcnt / lzcnt feature-check and handling zero should be done after the counting instruction. This way it is possible to convert potentially mispredicted jump to cmov*:

Note that lzcnt = 31 - bsr or bsr ^ 31; this idea can only work for tzcnt.

Interesting idea, but not clear whether it's a good idea to put a cmov on the critical path dependency chain through TZCNT. It could be worse depending on the use-case, especially pre-Broadwell where cmov is 2 uops. Looking at code-gen and microbenchmarking for some actual use-cases would be advisable.

Ideally you'd like the compiler to hoist the supported check out of a loop, at least for small loops possible (loop inversion, making 2 versions of the loop instead of branching inside the loop). So that's another thing to look for in some test cases.

And if value-range optimizations can prove that the input value is never 0, you want the check to optimize away and just use rep bsf. That should be able to happen with either way of doing the check, but again, something to verify when picking an implementation for real use with some compiler version.

AlexGuteniev · 2020-05-06T12:20:32Z

Ideally you'd like the compiler to hoist the supported check out of a loop, at least for small loops possible (loop inversion, making 2 versions of the loop instead of branching inside the loop). So that's another thing to look for in some test cases.

Apparently MSVC is not doing this for bit counting in a loop, all it does is caching supported in a register. But I imagine that some compiler does, and if a compiler ever does it, checking before version is easier for that.

barcharcraz · 2020-05-12T00:00:27Z

does not omit runtime checks if /arch indicates the instruction is always available. This is because all three instructions I use are indicated by their own CPUID and are not necessarily present on all CPUs supporting a given /arch. I could do the historical research to figure out which /arch is high enough, but the benefit isn't huge (it's one branch, that always goes the same way).

popcnt does have its own CPU feature flag, but SSE4.2 is assumed to imply popcnt. On Intel they were introduced at the same time (Nehalem). Every real CPU with SSE4.2 has popcnt, across all vendors.

In fact, GCC's -msse4.2 implies -mpopcnt, so doing the same thing in MSVC would not be surprising or weird https://godbolt.org/z/hspLqb. You can specify -mpopcnt separately to enable popcnt without SSE4.2 SIMD instructions, or disable it separately, like perhaps for kernel code where you want to allow popcnt but not SIMD instructions.

Later ISA extensions like AVX imply SSE4.2 and thus in practice popcnt, so any /arch:AVX* should also imply popcnt.

(Technically, I think Intel's documentation only specifies that the AVX1 feature flag implies that the VEX encoding of instructions like vpcmpestri are supported; at least that's the way Intel's ISA manual specifies it: https://www.felixcloutier.com/x86/pcmpestri. but in practice software can and does assume that Intel CPU features are incremental and a newer one like AVX2 implies all older ones like AVX1 and SSE*.)

Some AMD-only features like XOP and FMA4 have come and gone, and AMD SSE4a isn't implied by AVX on Intel CPUs. But Intel features are incremental and can be seen a "feature levels" not just feature bitmaps. And the coupling between SSE feature levels and bit-manipulation is that some software already does assume that SSE4.2 implies popcnt.

I'm pretty sure any virtual machine / emulator that indicated SSE4.2 support via CPUID but faulted on popcnt would not be able to run some real-world software, therefore it's safe for compilers to make the assumption that no such systems exist, or don't need to be supported. (i.e. they're not real x86 systems).

For the record, I don't know if Intel documents anywhere that later CPUID feature bits imply earlier ones, especially not the SSE4.2 / popcnt thing. So we'd be relying on widely-established practice here, probably not explicit documentation.

Unfortunately we can't do that with l/tzcnt since ivybridge has sse 4.2 but not l/tzcnt

stl/inc/limits

barcharcraz · 2020-06-18T00:44:36Z

I think that for tlcnt / lzcnt feature-check and handling zero should be done after the counting instruction. This way it is possible to convert potentially mispredicted jump to cmov*:
#include <intrin.h>

bool supported = false;

unsigned lzcnt_check_before(unsigned val) {
    if (!supported && val == 0) { // val == 0 comparison is JCC
        return 32;
    }
    unsigned result = __lzcnt(val);
    return result;
}

unsigned lzcnt_check_after(unsigned val) {
    unsigned result = __lzcnt(val);
    if (!supported && val == 0)  {  // val == 0 comparison is CMOV
        result = 32;
    }
    return result;
}
Or possibly even just perform static check, but no runtime check and have no conditional jumps:
unsigned lzcnt_check_after_unconditionally(unsigned val) {
    unsigned result = __lzcnt(val);
#ifndef LZCNT_AVAILABLE
    if (val == 0)  {
        result = 32;
    }
#endif
    return result;
}
https://godbolt.org/z/dwaySr
I'm now checking for tzcnt and lzcnt with __ISA_AVAILABLE_AVX2

I'm not going to flip the condition because I just don't want to deal with it, and we'd need better benchmark tests to make sure it's worth it. As for static checking, that's something I'd like, but can wait a bit. Dynamic checking is required because most programs are not compiled with non-default arch flags, and I still want that to work.

stl/inc/limits

stl/inc/bit

update bit for msvc (more updates) fixup arm countr and popcount small comment about intended widening.

…vc_bit

Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>

stl/inc/limits

stl/inc/bit

Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>

stl/inc/limits

stl/inc/bit

stl/inc/limits

StephanTLavavej

Looks perfect!

stl/inc/xtr1common

CaseyCarter · 2020-07-02T03:28:06Z

Thanks for your contribution, even if it did only involve a <bit> of work!

barcharcraz requested review from CaseyCarter, cbezault and StephanTLavavej May 5, 2020 21:44

barcharcraz requested a review from a team as a code owner May 5, 2020 21:44

StephanTLavavej added the cxx20 C++20 feature label May 5, 2020

StephanTLavavej changed the title ~~Msvc bit~~ MSVC <bit> May 5, 2020

CaseyCarter requested changes May 5, 2020

View reviewed changes

StephanTLavavej requested changes May 6, 2020

View reviewed changes

miscco reviewed May 6, 2020

View reviewed changes

stl/inc/bit Show resolved Hide resolved

miscco reviewed May 6, 2020

View reviewed changes

stl/inc/limits Outdated Show resolved Hide resolved

miscco reviewed May 6, 2020

View reviewed changes

stl/inc/limits Show resolved Hide resolved

CaseyCarter reviewed May 12, 2020

View reviewed changes

stl/inc/limits Outdated Show resolved Hide resolved

sylveon mentioned this pull request May 26, 2020

Remove #ifdef __cpp_lib_bitops when possible sylveon/member_thunk#26

Closed

barcharcraz closed this Jun 18, 2020

barcharcraz reopened this Jun 18, 2020

CaseyCarter requested changes Jun 18, 2020

View reviewed changes

stl/inc/limits Outdated Show resolved Hide resolved

stl/inc/limits Show resolved Hide resolved

stl/inc/limits Show resolved Hide resolved

stl/inc/limits Outdated Show resolved Hide resolved

AlexGuteniev reviewed Jun 18, 2020

View reviewed changes

stl/inc/bit Outdated Show resolved Hide resolved

Charles added 2 commits June 17, 2020 20:44

work on msvc bit enablement.

b177892

update bit for msvc (more updates) fixup arm countr and popcount small comment about intended widening.

move some bit things to limtis

e7c0fb3

barcharcraz and others added 5 commits June 29, 2020 17:45

add teansition to limits

0448a8f

fix more review comments.

9f4bed9

Merge branch 'msvc_bit' of https://github.com/barcharcraz/stl into ms…

720d976

…vc_bit

Apply suggestions from code review

6049f43

Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>

reword transition in limits.

f00a232

StephanTLavavej requested changes Jul 1, 2020

View reviewed changes

stl/inc/limits Outdated Show resolved Hide resolved

stl/inc/limits Outdated Show resolved Hide resolved

stl/inc/bit Outdated Show resolved Hide resolved

barcharcraz and others added 3 commits June 30, 2020 17:39

Update stl/inc/limits

3e81182

Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>

fix failing tests.

e16e720

apply review suggestions for code structure

3a66228

StephanTLavavej requested changes Jul 1, 2020

View reviewed changes

stl/inc/limits Outdated Show resolved Hide resolved

stl/inc/bit Outdated Show resolved Hide resolved

stl/inc/limits Outdated Show resolved Hide resolved

make the split count(r/l) implementation more clear.

0ac43b4

CaseyCarter approved these changes Jul 1, 2020

View reviewed changes

StephanTLavavej approved these changes Jul 1, 2020

View reviewed changes

AlexGuteniev reviewed Jul 1, 2020

View reviewed changes

stl/inc/xtr1common Show resolved Hide resolved

StephanTLavavej assigned CaseyCarter Jul 1, 2020

gcd needs _Countr_zero from <limits> in C++17

4ea5e7a

StephanTLavavej approved these changes Jul 2, 2020

View reviewed changes

CaseyCarter merged commit 5e3423a into microsoft:master Jul 2, 2020

CaseyCarter mentioned this pull request Jul 2, 2020

<bit>: Activate __cpp_lib_bitops for MSVC #313

Closed

CaseyCarter linked an issue Jul 2, 2020 that may be closed by this pull request

<bit>: Activate __cpp_lib_bitops for MSVC #313

Closed

AlexGuteniev mentioned this pull request Jul 2, 2020

Should STL try to avoid introducing standard identifiers without corresponding header included? #963

Closed

fsb4000 mentioned this pull request Jul 2, 2020

Wrong link at Changelog #967

Closed

CaseyCarter removed their assignment Jul 2, 2020

StephanTLavavej mentioned this pull request Jul 3, 2020

<numeric>: Consider avoiding <limits> inclusion #832

Closed

miscco mentioned this pull request Jul 9, 2020

<charconv>: instead of memcpy'ing the result you can calculate necessary buffer size ahead of time for integers #1024

Open

StephanTLavavej mentioned this pull request Apr 13, 2022

<bit>: fix countl_zero on arm64 #2647

Merged

lhecker mentioned this pull request Jul 7, 2022

<bit>: Is the __isa_available check for lzcnt worth the cost? #2849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSVC <bit> #795

MSVC <bit> #795

barcharcraz commented May 5, 2020 •

edited by CaseyCarter

Loading

pcordes commented May 6, 2020 •

edited by barcharcraz

Loading

AlexGuteniev commented May 6, 2020 •

edited

Loading

pcordes commented May 6, 2020 •

edited

Loading

AlexGuteniev commented May 6, 2020

barcharcraz commented May 12, 2020

barcharcraz commented Jun 18, 2020 •

edited

Loading

StephanTLavavej left a comment

CaseyCarter commented Jul 2, 2020

MSVC <bit> #795

MSVC <bit> #795

Conversation

barcharcraz commented May 5, 2020 • edited by CaseyCarter Loading

pcordes commented May 6, 2020 • edited by barcharcraz Loading

AlexGuteniev commented May 6, 2020 • edited Loading

pcordes commented May 6, 2020 • edited Loading

AlexGuteniev commented May 6, 2020

barcharcraz commented May 12, 2020

barcharcraz commented Jun 18, 2020 • edited Loading

StephanTLavavej left a comment

Choose a reason for hiding this comment

CaseyCarter commented Jul 2, 2020

barcharcraz commented May 5, 2020 •

edited by CaseyCarter

Loading

pcordes commented May 6, 2020 •

edited by barcharcraz

Loading

AlexGuteniev commented May 6, 2020 •

edited

Loading

pcordes commented May 6, 2020 •

edited

Loading

barcharcraz commented Jun 18, 2020 •

edited

Loading