Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSVC <bit> #795

Merged
merged 39 commits into from
Jul 2, 2020
Merged

MSVC <bit> #795

merged 39 commits into from
Jul 2, 2020

Conversation

barcharcraz
Copy link
Member

@barcharcraz barcharcraz commented May 5, 2020

enables <bit> functions on MSVC.

  • _Countr_zero has moved to <limits> to support numeric
  • still using static write race method to detect instruction support. I will push a revision to this PR using isa_available once the internal PR is up to add the bits needed to isa_available
  • less sophisticated on ARM than on x86 since the arm instructions relevant are NEON and can wait.
  • does not omit runtime checks if /arch indicates the instruction is always available. This is because all three instructions I use are indicated by their own CPUID and are not necessarily present on all CPUs supporting a given /arch. I could do the historical research to figure out which /arch is high enough, but the benefit isn't huge (it's one branch, that always goes the same way).
  • also adds a new set of test cases for 64bit values where the top half is ones and the bottom zeros or vice versa.

@barcharcraz barcharcraz requested a review from a team as a code owner May 5, 2020 21:44
@StephanTLavavej StephanTLavavej added the cxx20 C++20 feature label May 5, 2020
@StephanTLavavej StephanTLavavej changed the title Msvc bit MSVC <bit> May 5, 2020
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/yvals_core.h Outdated Show resolved Hide resolved
stl/inc/yvals_core.h Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/bit Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/limits Show resolved Hide resolved
@pcordes
Copy link

pcordes commented May 6, 2020

  • does not omit runtime checks if /arch indicates the instruction is always available. This is because all three instructions I use are indicated by their own CPUID and are not necessarily present on all CPUs supporting a given /arch. I could do the historical research to figure out which /arch is high enough, but the benefit isn't huge (it's one branch, that always goes the same way).

popcnt does have its own CPU feature flag, but SSE4.2 is assumed to imply popcnt. On Intel they were introduced at the same time (Nehalem). Every real CPU with SSE4.2 has popcnt, across all vendors.

In fact, GCC's -msse4.2 implies -mpopcnt, so doing the same thing in MSVC would not be surprising or weird https://godbolt.org/z/hspLqb. You can specify -mpopcnt separately to enable popcnt without SSE4.2 SIMD instructions, or disable it separately, like perhaps for kernel code where you want to allow popcnt but not SIMD instructions.

Later ISA extensions like AVX imply SSE4.2 and thus in practice popcnt, so any /arch:AVX* should also imply popcnt.

(Technically, I think Intel's documentation only specifies that the AVX1 feature flag implies that the VEX encoding of instructions like vpcmpestri are supported; at least that's the way Intel's ISA manual specifies it: https://www.felixcloutier.com/x86/pcmpestri. but in practice software can and does assume that Intel CPU features are incremental and a newer one like AVX2 implies all older ones like AVX1 and SSE*.)

Some AMD-only features like XOP and FMA4 have come and gone, and AMD SSE4a isn't implied by AVX on Intel CPUs. But Intel features are incremental and can be seen a "feature levels" not just feature bitmaps. And the coupling between SSE feature levels and bit-manipulation is that some software already does assume that SSE4.2 implies popcnt.

I'm pretty sure any virtual machine / emulator that indicated SSE4.2 support via CPUID but faulted on popcnt would not be able to run some real-world software, therefore it's safe for compilers to make the assumption that no such systems exist, or don't need to be supported. (i.e. they're not real x86 systems).

For the record, I don't know if Intel documents anywhere that later CPUID feature bits imply earlier ones, especially not the SSE4.2 / popcnt thing. So we'd be relying on widely-established practice here, probably not explicit documentation.

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented May 6, 2020

I think that for tlcnt / lzcnt feature-check and handling zero should be done after the counting instruction. This way it is possible to convert potentially mispredicted jump to cmov*:

#include <intrin.h>

bool supported = false;

unsigned lzcnt_check_before(unsigned val) {
    if (!supported && val == 0) { // val == 0 comparison is JCC
        return 32;
    }
    unsigned result = __lzcnt(val);
    return result;
}

unsigned lzcnt_check_after(unsigned val) {
    unsigned result = __lzcnt(val);
    if (!supported && val == 0)  {  // val == 0 comparison is CMOV
        result = 32;
    }
    return result;
}

Or possibly even just perform static check, but no runtime check and have no conditional jumps:

unsigned lzcnt_check_after_unconditionally(unsigned val) {
    unsigned result = __lzcnt(val);
#ifndef LZCNT_AVAILABLE
    if (val == 0)  {
        result = 32;
    }
#endif
    return result;
}

https://godbolt.org/z/dwaySr

@pcordes
Copy link

pcordes commented May 6, 2020

I think that for tlcnt / lzcnt feature-check and handling zero should be done after the counting instruction. This way it is possible to convert potentially mispredicted jump to cmov*:

Note that lzcnt = 31 - bsr or bsr ^ 31; this idea can only work for tzcnt.

Interesting idea, but not clear whether it's a good idea to put a cmov on the critical path dependency chain through TZCNT. It could be worse depending on the use-case, especially pre-Broadwell where cmov is 2 uops. Looking at code-gen and microbenchmarking for some actual use-cases would be advisable.

Ideally you'd like the compiler to hoist the supported check out of a loop, at least for small loops possible (loop inversion, making 2 versions of the loop instead of branching inside the loop). So that's another thing to look for in some test cases.

And if value-range optimizations can prove that the input value is never 0, you want the check to optimize away and just use rep bsf. That should be able to happen with either way of doing the check, but again, something to verify when picking an implementation for real use with some compiler version.

@AlexGuteniev
Copy link
Contributor

Ideally you'd like the compiler to hoist the supported check out of a loop, at least for small loops possible (loop inversion, making 2 versions of the loop instead of branching inside the loop). So that's another thing to look for in some test cases.

Apparently MSVC is not doing this for bit counting in a loop, all it does is caching supported in a register. But I imagine that some compiler does, and if a compiler ever does it, checking before version is easier for that.

@barcharcraz
Copy link
Member Author

  • does not omit runtime checks if /arch indicates the instruction is always available. This is because all three instructions I use are indicated by their own CPUID and are not necessarily present on all CPUs supporting a given /arch. I could do the historical research to figure out which /arch is high enough, but the benefit isn't huge (it's one branch, that always goes the same way).

popcnt does have its own CPU feature flag, but SSE4.2 is assumed to imply popcnt. On Intel they were introduced at the same time (Nehalem). Every real CPU with SSE4.2 has popcnt, across all vendors.

In fact, GCC's -msse4.2 implies -mpopcnt, so doing the same thing in MSVC would not be surprising or weird https://godbolt.org/z/hspLqb. You can specify -mpopcnt separately to enable popcnt without SSE4.2 SIMD instructions, or disable it separately, like perhaps for kernel code where you want to allow popcnt but not SIMD instructions.

Later ISA extensions like AVX imply SSE4.2 and thus in practice popcnt, so any /arch:AVX* should also imply popcnt.

(Technically, I think Intel's documentation only specifies that the AVX1 feature flag implies that the VEX encoding of instructions like vpcmpestri are supported; at least that's the way Intel's ISA manual specifies it: https://www.felixcloutier.com/x86/pcmpestri. but in practice software can and does assume that Intel CPU features are incremental and a newer one like AVX2 implies all older ones like AVX1 and SSE*.)

Some AMD-only features like XOP and FMA4 have come and gone, and AMD SSE4a isn't implied by AVX on Intel CPUs. But Intel features are incremental and can be seen a "feature levels" not just feature bitmaps. And the coupling between SSE feature levels and bit-manipulation is that some software already does assume that SSE4.2 implies popcnt.

I'm pretty sure any virtual machine / emulator that indicated SSE4.2 support via CPUID but faulted on popcnt would not be able to run some real-world software, therefore it's safe for compilers to make the assumption that no such systems exist, or don't need to be supported. (i.e. they're not real x86 systems).

For the record, I don't know if Intel documents anywhere that later CPUID feature bits imply earlier ones, especially not the SSE4.2 / popcnt thing. So we'd be relying on widely-established practice here, probably not explicit documentation.

Unfortunately we can't do that with l/tzcnt since ivybridge has sse 4.2 but not l/tzcnt

stl/inc/limits Outdated Show resolved Hide resolved
@barcharcraz
Copy link
Member Author

barcharcraz commented Jun 18, 2020

I think that for tlcnt / lzcnt feature-check and handling zero should be done after the counting instruction. This way it is possible to convert potentially mispredicted jump to cmov*:

#include <intrin.h>

bool supported = false;

unsigned lzcnt_check_before(unsigned val) {
    if (!supported && val == 0) { // val == 0 comparison is JCC
        return 32;
    }
    unsigned result = __lzcnt(val);
    return result;
}

unsigned lzcnt_check_after(unsigned val) {
    unsigned result = __lzcnt(val);
    if (!supported && val == 0)  {  // val == 0 comparison is CMOV
        result = 32;
    }
    return result;
}

Or possibly even just perform static check, but no runtime check and have no conditional jumps:

unsigned lzcnt_check_after_unconditionally(unsigned val) {
    unsigned result = __lzcnt(val);
#ifndef LZCNT_AVAILABLE
    if (val == 0)  {
        result = 32;
    }
#endif
    return result;
}

https://godbolt.org/z/dwaySr
I'm now checking for tzcnt and lzcnt with __ISA_AVAILABLE_AVX2

I'm not going to flip the condition because I just don't want to deal with it, and we'd need better benchmark tests to make sure it's worth it. As for static checking, that's something I'd like, but can wait a bit. Dynamic checking is required because most programs are not compiled with non-default arch flags, and I still want that to work.

@barcharcraz barcharcraz reopened this Jun 18, 2020
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/limits Show resolved Hide resolved
stl/inc/limits Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
Charles added 2 commits June 17, 2020 20:44
update bit for msvc (more updates)

fixup arm countr and popcount

small comment about intended widening.
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/bit Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
Copy link
Member

@StephanTLavavej StephanTLavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks perfect!

@CaseyCarter CaseyCarter merged commit 5e3423a into microsoft:master Jul 2, 2020
@CaseyCarter
Copy link
Member

Thanks for your contribution, even if it did only involve a <bit> of work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cxx20 C++20 feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<bit>: Activate __cpp_lib_bitops for MSVC
8 participants