-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slice::sort_by and slice::unstable_sort_by have bad performance when targeting wasm #27
Comments
Maybe @stjepang could shed some possible light as to why? He did implement a faster sorting algorithm in rust-lang/rust#38192. |
If it's affecting |
I tried to avoid JIT issues by doing 5 warm up iterations, followed by 100 iterations that actually get collected. There is a yield to the event loop in between each iteration. Each iteration takes ~500ms, so it should be enough time and opportunity for the optimized wasm to get swapped in. |
Difference in generated wast: One notable difference: in slice::swap version there is a lot of memcpy/memmoves, in the same time in custom swap version they seem to be unrolled to loads and stores. |
Second thing that I've noticed, in slice::swap there are more bounds checks than in custom swap. Although, I think it is more probable that these memcpys are responsible for the slowdown. |
Ok, I'm still very new to wasm (must admit - running Rust in the browser is fun! :)). In fact, this is the first time ever I've compiled something to wasm. Here's my setup. hello.rsuse std::mem;
use std::cmp::{self, Ordering};
#[inline(always)]
fn swap<T>(slice: &mut [T], x: usize, y: usize) {
debug_assert!(x < slice.len(), "(x = {}) < (slice.len() = {})", x, slice.len());
debug_assert!(y < slice.len(), "(y = {}) < (slice.len() = {})", y, slice.len());
if x == y {
return;
}
let (x, y) = (cmp::min(x, y), cmp::max(x, y));
let (low, high) = slice.split_at_mut(y);
debug_assert!(x < low.len());
debug_assert!(0 < high.len());
unsafe {
mem::swap(low.get_unchecked_mut(x), high.get_unchecked_mut(0));
}
}
/// Partition the `slice[p..r]` about some pivot element in that range, and
/// return the index of the pivot.
#[inline(always)]
fn partition<T: Ord>(slice: &mut [T], p: usize, r: usize) -> usize {
let pivot = (p + r) / 2;
swap(slice, pivot, r);
let mut i = (p as isize) - 1;
for j in p..r {
if let Ordering::Greater = unsafe {
debug_assert!(j < slice.len());
debug_assert!(r < slice.len());
Ord::cmp(slice.get_unchecked(j), slice.get_unchecked(r))
} {
continue;
}
i += 1;
swap(slice, i as usize, j);
}
swap(slice, (i + 1) as usize, r);
return (i + 1) as usize;
}
/// Recursive quick sort implementation with all the extra parameters that we
/// want to hide from callers to give them better ergonomics.
fn do_quick_sort<T: Ord>(slice: &mut [T], p: usize, r: usize) {
if p < r {
let q = partition(slice, p, r);
do_quick_sort(slice, p, q.saturating_sub(1));
do_quick_sort(slice, q + 1, r);
}
}
/// Do a quick sort on the given slice.
pub fn quick_sort<T: Ord>(slice: &mut [T]) {
if slice.is_empty() {
return;
}
let len = slice.len();
do_quick_sort(slice, 0, len - 1);
}
#[no_mangle]
pub extern fn hello() {
let mut v = (0u64..1_000_000).map(|x| x * x * x * 18913515181).collect::<Vec<_>>();
v.sort(); // 200 ms
// v.sort_unstable(); // 70 ms
// quick_sort(&mut v); // 150 ms
}
fn main() {} hello.html<html>
<body>
<script>
fetch("hello.wasm")
.then(response => response.arrayBuffer())
.then(bytes => WebAssembly.instantiate(bytes, {}))
.then(results => results.instance)
.then(mod => {
var start = new Date();
mod.exports.hello();
console.log(new Date() - start);
});
</script>
</body>
</html> Compile: Results for sorting 1M random
And here's the same benchmark with compilation to native: hello.rsuse std::mem;
use std::cmp::{self, Ordering};
#[inline(always)]
fn swap<T>(slice: &mut [T], x: usize, y: usize) {
debug_assert!(x < slice.len(), "(x = {}) < (slice.len() = {})", x, slice.len());
debug_assert!(y < slice.len(), "(y = {}) < (slice.len() = {})", y, slice.len());
if x == y {
return;
}
let (x, y) = (cmp::min(x, y), cmp::max(x, y));
let (low, high) = slice.split_at_mut(y);
debug_assert!(x < low.len());
debug_assert!(0 < high.len());
unsafe {
mem::swap(low.get_unchecked_mut(x), high.get_unchecked_mut(0));
}
}
/// Partition the `slice[p..r]` about some pivot element in that range, and
/// return the index of the pivot.
#[inline(always)]
fn partition<T: Ord>(slice: &mut [T], p: usize, r: usize) -> usize {
let pivot = (p + r) / 2;
swap(slice, pivot, r);
let mut i = (p as isize) - 1;
for j in p..r {
if let Ordering::Greater = unsafe {
debug_assert!(j < slice.len());
debug_assert!(r < slice.len());
Ord::cmp(slice.get_unchecked(j), slice.get_unchecked(r))
} {
continue;
}
i += 1;
swap(slice, i as usize, j);
}
swap(slice, (i + 1) as usize, r);
return (i + 1) as usize;
}
/// Recursive quick sort implementation with all the extra parameters that we
/// want to hide from callers to give them better ergonomics.
fn do_quick_sort<T: Ord>(slice: &mut [T], p: usize, r: usize) {
if p < r {
let q = partition(slice, p, r);
do_quick_sort(slice, p, q.saturating_sub(1));
do_quick_sort(slice, q + 1, r);
}
}
/// Do a quick sort on the given slice.
pub fn quick_sort<T: Ord>(slice: &mut [T]) {
if slice.is_empty() {
return;
}
let len = slice.len();
do_quick_sort(slice, 0, len - 1);
}
fn main() {
let mut v = (0u64..1_000_000).map(|x| x * x * x * 18913515181).collect::<Vec<_>>();
v.sort(); // 70 ms
// v.sort_unstable(); // 38 ms
// quick_sort(&mut v); // 100 ms
} Compile: Results:
These numbers look completely reasonable. Am I doing something wrong? :) |
Those numbers do look reasonable, however when I replace calls to my Perhaps it is related to the size of the data being sorted and the comparator functions that are sorting them? https://github.com/fitzgen/source-map-mappings/blob/master/src/comparators.rs#L49-L91 The differences in codegen that @pepyakin found are also interesting, and seem promising for understanding better what's going on here. |
I had a brief look at the problem today again: I took |
I just retested on the latest nightly, and it seems the problem has gone away. |
Nice! I wouldn't be surprised if lld support helped it out. |
Woops thought it was for some reason. |
LLVM6 was merged, and it might have to do with that. /me shrugs |
The mysteries of compilers and how no one is truly sure why things get fixed, they just do. |
...but not when targeting native code.
I am very confused.
My naive Quicksort outperforms them both by a couple orders of magnitude(!!!!!) regardless which browser it is executed in (which tells me this is a Rust/LLVM issue, not wasm engine issue). This Quicksort is very naive: it is not pattern defeating, nor optimistically using insertion sort for small and mostly sorted ranges, or any of the other things that the built-in sort is doing.
https://github.com/fitzgen/source-map-mappings/blob/master/src/sort.rs
In fact, just replacing the custom
swap
routine withcore
'sslice::swap
causes a huge slow down. Again, only with wasm, not native code. That makes no sense to me.Why is this particular performance gap occurring?
In the more general case, how do we debug and profile these cases where
perf(wasm) != perf(native)
and seemingly irrelevant changes have large effects on performance?(I guess the first thing to do in this particular case is diff the disassembly and see how the emitted code differs? What other techniques do we have?)
The text was updated successfully, but these errors were encountered: