Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice::sort_by and slice::unstable_sort_by have bad performance when targeting wasm #27

Closed
fitzgen opened this issue Jan 17, 2018 · 14 comments

Comments

@fitzgen
Copy link
Member

fitzgen commented Jan 17, 2018

...but not when targeting native code.

I am very confused.

My naive Quicksort outperforms them both by a couple orders of magnitude(!!!!!) regardless which browser it is executed in (which tells me this is a Rust/LLVM issue, not wasm engine issue). This Quicksort is very naive: it is not pattern defeating, nor optimistically using insertion sort for small and mostly sorted ranges, or any of the other things that the built-in sort is doing.

https://github.com/fitzgen/source-map-mappings/blob/master/src/sort.rs

In fact, just replacing the custom swap routine with core's slice::swap causes a huge slow down. Again, only with wasm, not native code. That makes no sense to me.

  • Why is this particular performance gap occurring?

  • In the more general case, how do we debug and profile these cases where perf(wasm) != perf(native) and seemingly irrelevant changes have large effects on performance?

(I guess the first thing to do in this particular case is diff the disassembly and see how the emitted code differs? What other techniques do we have?)

@mgattozzi
Copy link
Contributor

Maybe @stjepang could shed some possible light as to why? He did implement a faster sorting algorithm in rust-lang/rust#38192.

@Diggsey
Copy link

Diggsey commented Jan 17, 2018

If it's affecting swap too, then it sounds like a codegen issue - could it be failing to inline functions called with references? Alternatively, JIT compilation could be confusing the benchmarking you are doing.

@fitzgen
Copy link
Member Author

fitzgen commented Jan 17, 2018

I tried to avoid JIT issues by doing 5 warm up iterations, followed by 100 iterations that actually get collected. There is a yield to the event loop in between each iteration. Each iteration takes ~500ms, so it should be enough time and opportunity for the optimized wasm to get swapped in.

@pepyakin
Copy link
Member

Difference in generated wast:

  1. custom swap version
  2. slice::swap version

One notable difference: in slice::swap version there is a lot of memcpy/memmoves, in the same time in custom swap version they seem to be unrolled to loads and stores.

@pepyakin
Copy link
Member

pepyakin commented Jan 18, 2018

Second thing that I've noticed, in slice::swap there are more bounds checks than in custom swap.

Although, I think it is more probable that these memcpys are responsible for the slowdown.

@ghost
Copy link

ghost commented Jan 18, 2018

Ok, I'm still very new to wasm (must admit - running Rust in the browser is fun! :)). In fact, this is the first time ever I've compiled something to wasm. Here's my setup.

hello.rs
use std::mem;
use std::cmp::{self, Ordering};

#[inline(always)]
fn swap<T>(slice: &mut [T], x: usize, y: usize) {
    debug_assert!(x < slice.len(), "(x = {}) < (slice.len() = {})", x, slice.len());
    debug_assert!(y < slice.len(), "(y = {}) < (slice.len() = {})", y, slice.len());

    if x == y {
        return;
    }

    let (x, y) = (cmp::min(x, y), cmp::max(x, y));
    let (low, high) = slice.split_at_mut(y);

    debug_assert!(x < low.len());
    debug_assert!(0 < high.len());

    unsafe {
        mem::swap(low.get_unchecked_mut(x), high.get_unchecked_mut(0));
    }
}

/// Partition the `slice[p..r]` about some pivot element in that range, and
/// return the index of the pivot.
#[inline(always)]
fn partition<T: Ord>(slice: &mut [T], p: usize, r: usize) -> usize {
    let pivot = (p + r) / 2;
    swap(slice, pivot, r);

    let mut i = (p as isize) - 1;

    for j in p..r {
        if let Ordering::Greater = unsafe {
            debug_assert!(j < slice.len());
            debug_assert!(r < slice.len());
            Ord::cmp(slice.get_unchecked(j), slice.get_unchecked(r))
        } {
            continue;
        }

        i += 1;
        swap(slice, i as usize, j);
    }

    swap(slice, (i + 1) as usize, r);
    return (i + 1) as usize;
}

/// Recursive quick sort implementation with all the extra parameters that we
/// want to hide from callers to give them better ergonomics.
fn do_quick_sort<T: Ord>(slice: &mut [T], p: usize, r: usize) {
    if p < r {
        let q = partition(slice, p, r);
        do_quick_sort(slice, p, q.saturating_sub(1));
        do_quick_sort(slice, q + 1, r);
    }
}

/// Do a quick sort on the given slice.
pub fn quick_sort<T: Ord>(slice: &mut [T]) {
    if slice.is_empty() {
        return;
    }

    let len = slice.len();
    do_quick_sort(slice, 0, len - 1);
}

#[no_mangle]
pub extern fn hello() {
    let mut v = (0u64..1_000_000).map(|x| x * x * x * 18913515181).collect::<Vec<_>>();
    v.sort(); // 200 ms
    // v.sort_unstable(); // 70 ms
    // quick_sort(&mut v); // 150 ms
}

fn main() {}
hello.html
<html>
  <body>
    <script>
      fetch("hello.wasm")
        .then(response => response.arrayBuffer())
        .then(bytes => WebAssembly.instantiate(bytes, {}))
        .then(results => results.instance)
        .then(mod => {
          var start = new Date();
          mod.exports.hello();
          console.log(new Date() - start);
        });
    </script>
  </body>
</html>

Compile: rustc --target wasm32-unknown-unknown hello.rs -O
Run: firefox hello.html (and look at the console)

Results for sorting 1M random u64s:

  • sort: 200 ms
  • sort_unstable: 70 ms
  • quick_sort: 150 ms

And here's the same benchmark with compilation to native:

hello.rs
use std::mem;
use std::cmp::{self, Ordering};

#[inline(always)]
fn swap<T>(slice: &mut [T], x: usize, y: usize) {
    debug_assert!(x < slice.len(), "(x = {}) < (slice.len() = {})", x, slice.len());
    debug_assert!(y < slice.len(), "(y = {}) < (slice.len() = {})", y, slice.len());

    if x == y {
        return;
    }

    let (x, y) = (cmp::min(x, y), cmp::max(x, y));
    let (low, high) = slice.split_at_mut(y);

    debug_assert!(x < low.len());
    debug_assert!(0 < high.len());

    unsafe {
        mem::swap(low.get_unchecked_mut(x), high.get_unchecked_mut(0));
    }
}

/// Partition the `slice[p..r]` about some pivot element in that range, and
/// return the index of the pivot.
#[inline(always)]
fn partition<T: Ord>(slice: &mut [T], p: usize, r: usize) -> usize {
    let pivot = (p + r) / 2;
    swap(slice, pivot, r);

    let mut i = (p as isize) - 1;

    for j in p..r {
        if let Ordering::Greater = unsafe {
            debug_assert!(j < slice.len());
            debug_assert!(r < slice.len());
            Ord::cmp(slice.get_unchecked(j), slice.get_unchecked(r))
        } {
            continue;
        }

        i += 1;
        swap(slice, i as usize, j);
    }

    swap(slice, (i + 1) as usize, r);
    return (i + 1) as usize;
}

/// Recursive quick sort implementation with all the extra parameters that we
/// want to hide from callers to give them better ergonomics.
fn do_quick_sort<T: Ord>(slice: &mut [T], p: usize, r: usize) {
    if p < r {
        let q = partition(slice, p, r);
        do_quick_sort(slice, p, q.saturating_sub(1));
        do_quick_sort(slice, q + 1, r);
    }
}

/// Do a quick sort on the given slice.
pub fn quick_sort<T: Ord>(slice: &mut [T]) {
    if slice.is_empty() {
        return;
    }

    let len = slice.len();
    do_quick_sort(slice, 0, len - 1);
}

fn main() {
    let mut v = (0u64..1_000_000).map(|x| x * x * x * 18913515181).collect::<Vec<_>>();
    v.sort(); // 70 ms
    // v.sort_unstable(); // 38 ms
    // quick_sort(&mut v); // 100 ms
}

Compile: rustc hello.rs -O
Run: time ./hello

Results:

  • sort: 70 ms
  • sort_unstable: 38 ms
  • quick_sort: 100 ms

These numbers look completely reasonable. Am I doing something wrong? :)

@fitzgen
Copy link
Member Author

fitzgen commented Jan 19, 2018

Those numbers do look reasonable, however when I replace calls to my quick_sort with .sort_by or .sort_unstable_by I don't get reasonable numbers like that.

Perhaps it is related to the size of the data being sorted and the comparator functions that are sorting them? https://github.com/fitzgen/source-map-mappings/blob/master/src/comparators.rs#L49-L91

The differences in codegen that @pepyakin found are also interesting, and seem promising for understanding better what's going on here.

@pepyakin
Copy link
Member

I had a brief look at the problem today again: I took hello.rs by @stjepang, converted it to no_std and there were no memcpys with slice::swap 👻

@fitzgen
Copy link
Member Author

fitzgen commented Feb 22, 2018

I just retested on the latest nightly, and it seems the problem has gone away. sort_unstable_by is now much faster than my naive quick sort 🎉

@fitzgen fitzgen closed this as completed Feb 22, 2018
@mgattozzi
Copy link
Contributor

Nice! I wouldn't be surprised if lld support helped it out.

@pepyakin
Copy link
Member

Isn't LLD not landed yet?

@mgattozzi
Copy link
Contributor

Woops thought it was for some reason.

@fitzgen
Copy link
Member Author

fitzgen commented Feb 22, 2018

LLVM6 was merged, and it might have to do with that.

/me shrugs

@mgattozzi
Copy link
Contributor

The mysteries of compilers and how no one is truly sure why things get fixed, they just do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants