ascii.rs optimization since ptr is aligned #193

danakj · 2024-10-23T20:31:10Z

On line 186, there is a call to_mm_loadu_si128:

Line 186 in 3dc5939

let chunk = _mm_loadu_si128(ptr as *const __m128i);

That call does not require the pointer to be aligned at all. But it could be replaced with _mm_load_si128, which requires alignment to a 16-byte boundary.

On line 139, the ptr is aligned to the VECTOR_ALIGN mask, which aligns it to size_of::<__m128i>(), or to 16 bytes.

bstr/src/ascii.rs

Lines 120 to 121 in 3dc5939

    
           const VECTOR_SIZE: usize = core::mem::size_of::<__m128i>(); 
        
           const VECTOR_ALIGN: usize = VECTOR_SIZE - 1;

https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L139C9-L139C12

The ptr is always advanced by VECTOR_LOOP_SIZE in a loop, which is a multiple of 16 bytes:

bstr/src/ascii.rs

Lines 120 to 122 in 3dc5939

    
           const VECTOR_SIZE: usize = core::mem::size_of::<__m128i>(); 
        
           const VECTOR_ALIGN: usize = VECTOR_SIZE - 1; 
        
           const VECTOR_LOOP_SIZE: usize = 4 * VECTOR_SIZE;

https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L180C36-L180C52

And then further advanced by VECTOR_SIZE which is 16 bytes:

bstr/src/ascii.rs

Line 120 in 3dc5939

const VECTOR_SIZE: usize = core::mem::size_of::<__m128i>();

bstr/src/ascii.rs

Line 191 in 3dc5939

ptr = ptr.add(VECTOR_SIZE);

So in the loop at L186, the pointer is always aligned to 16 bytes:

bstr/src/ascii.rs

Line 186 in 3dc5939

let chunk = _mm_loadu_si128(ptr as *const __m128i);

The aligned version of the function was pointed out by @anforowicz during unsafe code audit: https://chromium-review.googlesource.com/c/chromium/src/+/5925797/comment/f08dc00c_1b24061c/

The text was updated successfully, but these errors were encountered:

BurntSushi · 2024-10-23T23:22:32Z

Hmmm I think actually there's probably a bug here. We probably should be making use of unaligned loads here to avoid the slow byte-at-a-time loop. Like this:

https://github.com/BurntSushi/memchr/blob/a26dd5b590e0a0ddcce454fb1ac02f5586e50952/src/arch/generic/memchr.rs#L219-L228

(Although that interestingly does use an unaligned load just before when it could be an aligned load.)

Aligned versus unaligned rarely makes a difference on x86-64 in my experience anyway, and especially here where this is just the "cleanup" aspect of the implementation.

But good catch. I agree that as the code is currently written, it can be safely an aligned load.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ascii.rs optimization since ptr is aligned #193

ascii.rs optimization since ptr is aligned #193

danakj commented Oct 23, 2024

BurntSushi commented Oct 23, 2024

ascii.rs optimization since ptr is aligned #193

ascii.rs optimization since ptr is aligned #193

Comments

danakj commented Oct 23, 2024

BurntSushi commented Oct 23, 2024