Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addition/subtraction acceleration via inline assembly #314

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

twiby
Copy link

@twiby twiby commented Jul 8, 2024

During the addition/subtraction loops, there is performance to be gained by specifying the exact assembly instructions. The performance gain start to be felt after 1,000bit integers, and are maximal around 10,000bits, where we are near 2x performance.

Documentation for assembly instructions can be found here

Added assembly

Principle

The main idea is to avoid any instruction that are modifying the carry flags, except for adc and sbb of course, so that the carry is propagated through the whole loop. This is doable if we use a decreasing index for the loop counter and a jnz instruction. The indices we use have to be updated using inc anddec instructions (those instructions aren't modifying the carry flag).

Batching

To avoid overhead by looping digits one by one, we fill the 16 registers as much as possible. This enables us to treat digits 5 by 5 in one step (enabling efficient CPU pipelining).

A consequence of this is that a number of digits is still not added after the inline assembly code is run (the number is strictly less than 5). So we keep the existing code to handle the remaining digits.

Safety

Obviously this is very much unsafe. On my personal project, where I want to let people the option to rely on safe Rust, I put these functions behind a feature gate. If necessary, I can take a similar approach here.

Performance

I've measured performance by using the criterion crate, and adding 2 BigUint of different sizes. The results are below. On some scales, this gives close a 2x gain ! To me the performance gain is clear enough to be non-ambiguous, though I'm not a benchmarking expert, and I'm sure a finer evaluation could be interesting.

Figure_1

Style

For easy review, the 2 added functions are in the files addition and subtraction, but I'd argue in the end, it'd be better to put them in a dedicated sub-module, they look far too different.

Also I'm notoriously average at naming things, so suggestions are welcome ^^

@cuviper
Copy link
Member

cuviper commented Jul 8, 2024

Can you share your benchmark code somewhere? I'd like to compare results with various Intel and AMD machines, and also when compared to -Ctarget-cpu=native. I'm not sure how much the CPU will matter when we're already using adc/sbb intrinsics, but it's worth checking.

In general, I would rather see Rust-native changes that coax LLVM toward the codegen we want, if that's possible, because that's more likely to play well with target-cpu and even other architectures.

Also, I'm pretty sure this needs more cfg to avoid breaking x32 -- see the recent #311 and #312.

@twiby
Copy link
Author

twiby commented Jul 9, 2024

native rust concern

I've tried to play around in godbolt, but I could not generate the assembly I want, which is quite brittle: as soon as any instruction touches the carry flag (add or sub for loop indices, or even cmp), you have to add instructions to store the carry flag, and then conditionally set it back before the adc calls (a true 50% branch condition). I'll try again to see if I can get more luck

x32

thanks, I did not know target_arch = "x86_64" wasn't enough ! I don't have a 32 bit machine, but I'll try to fix those anyway. Also This whole trick should be applied to 32 bit machines as well, with almost exactly the same assembly, but not having a 32 bit machine, I'll not risk it.

benchmark code

My benchmark code is below, it is not particularly fancy:

use num_bigint::BigUint;

use rand::distributions::Standard;
use rand::prelude::Distribution;
use rand::{thread_rng, Rng};

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn gen_n_random_values<T>(n: usize) -> Vec<T>
where
    Standard: Distribution<T>,
{
    let mut ret = Vec::<T>::with_capacity(n);
    for _ in 0..n {
        ret.push(thread_rng().gen::<T>());
    }
    ret
}

pub fn add<const N: usize>(c: &mut Criterion) {
    let mut name = "num-bigint add ".to_string();
    name.push_str(&N.to_string());

    println!("nb digits: {:?}", (N / 32) * 32);
    let n1 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    let n2 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    c.bench_function(name.as_str(), |b| b.iter(|| black_box(&n1 + &n2)));
}

criterion_group!(
    num_bigint_add,
    add<10>,
    add<30>,
    add<100>,
    add<300>,
    add<1_000>,
    add<3_000>,
    add<10_000>,
    add<30_000>,
    add<100_000>,
    add<300_000>,
    add<1_000_000>,
    add<3_000_000>,
    add<10_000_000>,
);

pub fn sub<const N: usize>(c: &mut Criterion) {
    let mut name = "num-bigint sub ".to_string();
    name.push_str(&N.to_string());

    let n1 = BigUint::new(gen_n_random_values::<u32>(N / 32 + 1));
    let n2 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    c.bench_function(name.as_str(), |b| b.iter(|| black_box(&n1 - &n2)));
}

criterion_group!(
    biguint_sub,
    sub<1_000>,
    sub<3_000>,
    sub<10_000>,
    sub<30_000>,
    sub<100_000>,
    sub<300_000>,
    sub<1_000_000>,
    sub<3_000_000>,
    sub<10_000_000>,
);

pub fn mul<const N: usize>(c: &mut Criterion) {
    let mut name = "num-bigint mul ".to_string();
    name.push_str(&N.to_string());

    println!("nb digits: {:?}", (N / 32) * 32);
    let n1 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    let n2 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    c.bench_function(name.as_str(), |b| b.iter(|| black_box(&n1 * &n2)));
}

criterion_group!(
    biguint_mul,
    mul<10>,
    mul<30>,
    mul<100>,
    mul<300>,
    mul<1_000>,
    mul<3_000>,
    mul<10_000>,
    mul<30_000>,
);

criterion_main!(num_bigint_add, biguint_sub, biguint_mul);

@twiby
Copy link
Author

twiby commented Jul 9, 2024

I've pushed the fix for 32bit pointer width targets.
I was able to cross-compile for target i686-pc-windows-msvc, but could not run the tests.
I was able to compile+test on my raspberry pi (arm target)

@cuviper
Copy link
Member

cuviper commented Jul 10, 2024

By x32, I mean the odd target of x86_64-unknown-linux-gnux32, which is using a 64-bit CPU with 32-bit pointers. In Rust that looks like cfg(all(target_arch="x86_64", target_pointer_width="32")). See https://en.wikipedia.org/wiki/X32_ABI

Thanks for your benchmark code - I'll take a closer look later.

@twiby
Copy link
Author

twiby commented Jul 10, 2024

Thank you for the precision ! It did compile fine for x86_64-unknown-linux-gnux32 but with warnings. I've pushed changes to remove the warnings.

Thank you for taking the time to look at this !

@twiby
Copy link
Author

twiby commented Jul 17, 2024

Have you had the time to look at benchmarks ? I'm very curious about the behavior on other machines !

@twiby
Copy link
Author

twiby commented Jul 20, 2024

For reference, this equivalent experimental implementation (without the asm macro) produces a much shorter assembly, but less "pipeplineable", and has performance very similar to current master branch. I could not make it more efficient, via trying for example different batch sizes.

All in all, the current PR still has the best performance by far of the approaches I tried.

#[cfg(target_arch = "x86_64")]
cfg_64!(
    /// Performs a part of the addition. Returns a tuple containing the carry state
    /// and the number of integers that were added
    ///
    /// By using as many registers as possible, we treat digits 5 by 5
    unsafe fn schoolbook_add_assign_x86_64(
        lhs: &mut [u64],
        rhs: &[u64],
        mut size: usize,
    ) -> (bool, usize) {
        if size < 5 {
            return (false, 0);
        }
        size -= 5;

        let mut c = 0u8;
        let mut idx = 0;

        while idx < size {
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
        }

        (c > 0, idx)
    }
);

Assembly generated:

schoolbook_add_assign_x86_64:
        cmp     r8, 5
        jb      .LBB1_1
        add     r8, -5
        je      .LBB1_1
        mov     rax, rdx
        mov     edx, 4
        xor     ecx, ecx
.LBB1_4:
        mov     rsi, rdx
        mov     rdx, qword ptr [rax + 8*rdx - 32]
        add     cl, -1
        adc     qword ptr [rdi + 8*rsi - 32], rdx
        mov     rcx, qword ptr [rax + 8*rsi - 24]
        adc     qword ptr [rdi + 8*rsi - 24], rcx
        mov     rcx, qword ptr [rax + 8*rsi - 16]
        adc     qword ptr [rdi + 8*rsi - 16], rcx
        mov     rcx, qword ptr [rax + 8*rsi - 8]
        adc     qword ptr [rdi + 8*rsi - 8], rcx
        mov     rcx, qword ptr [rax + 8*rsi]
        adc     qword ptr [rdi + 8*rsi], rcx
        setb    cl
        lea     rdx, [rsi + 5]
        inc     rsi
        cmp     rsi, r8
        jb      .LBB1_4
        test    cl, cl
        setne   al
        add     rdx, -4
        ret
.LBB1_1:
        xor     edx, edx
        xor     eax, eax
        ret

In particular, I cannot make the assembly use more registers in a single batch, so the additions are not pipelined together.

On the other end, I think was wrong thinking overwriting the CF flag is a branching issue: it is reset via the add cl, -1 instruction (still a bit expensive). The penalty is not as big as a true branching at all.

@twiby
Copy link
Author

twiby commented Dec 3, 2024

This PR has been open for quite some time. This could be closed if nothing is ever gonna happen ? In any case, this proposition is implemented and maintained here, an experimental repo with much the same features as num-bigint

@cuviper
Copy link
Member

cuviper commented Dec 4, 2024

AFAICS, we already get the same kind of unrolled optimization as your non-asm version, only it chooses 4 at a time instead of your 5, and when I use -Ctarget-cpu=native it unrolls 8 for me. I also get a mix of better and worse results on my Ryzen 7 5800X compared to the baseline, but I think part of that problem is that the benchmark is using a non-seeded rng, so it's not actually comparing the same thing on every run.

I want to keep this open to investigate further though. There's clearly some performance difference here, but I'm not sure of the true reason behind it.

In particular, I cannot make the assembly use more registers in a single batch, so the additions are not pipelined together.

I'm not sure why we need/want to force more named registers. Shouldn't the CPU's out-of-order execution with register renaming already handle this efficiently? There's a data-dependency chain on the carry flag anyway, but the stores should still work with the old version of the register even while the next one is computed.

@twiby
Copy link
Author

twiby commented Dec 6, 2024

I understand why the PR is still open.

The benchmark should definitely run on the same integers! The version I previously sent did not take that into account yet.

Tbh I'm not entirely convinced about the explanation for the performance difference either. I guess it must be related to the pipelining of the additions, but I'm not confident enough to be sure of any explanations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants