-
Notifications
You must be signed in to change notification settings - Fork 32
Bitset optimisations
This wikipage lays out our optimisation attempts to improve the performance of bitset.c. To not require portability between platforms we may improve performance on specific platforms. Alternatively some code may have been optimised in upstream libraries. Here we compare our code.
We use bitset_clear to set all bits to zero after allocation, hence calloc may have been called already and precious operations took place. Overshooting the requested number of bit doesn't hurt here, we never count the number of zero bits in our bitset.
The naive implementation of bitset_clear iterates over all chunks used for allocation. An initial optimisation has been applied to use a decreasing while-loop instead of an increasing for-loop.
void bitset_clear(bitset_t *self) {
uint32_t i_chunk = self->n_chunks;
while (i_chunk) {
i_chunk--;
self->chunks[i_chunk] = (bits_t) 0;
}
}
An alternative approaches are the now deprecrated void bzero(void *s, size_t n);
and void *memset(void *s, int c, size_t n);
from strings.h.
void bitset_clear(bitset_t *self) {
bzero (self->chunks, sizeof (bits_t) * self->n_chunks);
}
void bitset_clear(bitset_t *self) {
memset (self->chunks, 0, sizeof (bits_t) * self->n_chunks);
}
#include "bitset.h"
int main() {
uint64_t i = 10000;
bitset_t *bitset = bitset_new(100000000);
while (i) {
i--;
bitset_clear(bitset);
}
}
gcc -O2 -o bench-naive benchmark.c bitset.c
gcc -O2 -o bench-bzero benchmark.c bitset-bzero.c
gcc -O2 -o bench-memset benchmark.c bitset-memset.c
operation real user sys
bench-naive 0m23.653s 0m23.625s 0m0.002s
bench-bzero 0m13.596s 0m13.572s 0m0.008s
bench-memset 0m13.586s 0m13.566s 0m0.004s
We use bitset_get to get a single bit from a bitstring which is split over many different chunks.
The naive implementation of bitset_get finds the correct memory chunk (index >> BS_SHIFT
) and applies a mask for the nth-bit in this memory block.
bool bitset_get(bitset_t *self, uint32_t index) {
return self->chunks[index >> BS_SHIFT] & ((bits_t) 1) << (index & (BS_BITS - 1));
}
The SSE4.1 processor extension allow for efficient processing of 128bit values. For this benchmark we first test an implementation using uint64_t values. SSE4.1 doesn't allow to leftshift an arbitrary number of bits, it shifts by 8. For this reason we naively compute the mask first.
bool bitset_get(bitset_t *self, uint32_t index) {
bits_t _mask = ((bits_t) 1) << (index & (BS_BITS - 1));
__m128i x = _mm_setr_epi64 ((__m64) (uint64_t) 0, (__m64) self->chunks[index >> BS_SHIFT]);
__m128i mask = _mm_setr_epi64 ((__m64) (uint64_t) 0, (__m64) _mask);
__m128i t = _mm_and_si128 (x, mask);
return !_mm_testz_si128 (t,t);
}
Our second implementation should test 128bit values. The performance is so poor we wonder if someone has a clue what might have gone wrong.
bool bitset_get(bitset_t *self, uint32_t index) {
bits_t _mask = ((bits_t) 1) << (index & (BS_BITS - 1));
__m128i x = _mm_loadu_si128 ((__m128i *) &self->chunks[index >> BS_SHIFT]);
__m128i mask = _mm_loadu_si128 ((__m128i *) &_mask);
__m128i t = _mm_and_si128 (x, mask);
return !_mm_testz_si128 (t, t);
}
Other methods exist to use instead of left shifting to the mask value. An option for example is to use a lookup table.
#include "bitset.h"
int main() {
uint32_t i = 100;
bitset_t *bitset = bitset_new(100000000);
bitset_clear(bitset);
while (i) {
uint32_t j = 100000000;
i--;
while (j) {
j--;
bitset_get (bitset, j);
}
}
}
gcc -O2 -march=native -DRRRR_BITSET_32 -o bench-naive-32 benchmark.c bitset-memset.c
gcc -O2 -march=native -DRRRR_BITSET_64 -o bench-naive-64 benchmark.c bitset-memset.c
gcc -O2 -march=native -DRRRR_BITSET_64 -o bench-sse41-64 benchmark.c bitset-memset-sse41.c
gcc -O2 -march=native -DRRRR_BITSET_128 -o bench-naive-128 benchmark.c bitset-memset.c
gcc -O2 -march=native -DRRRR_BITSET_128 -o bench-sse41-128 benchmark.c bitset-memset-sse41-128.c
operation real user sys
bench-naive-32 0m24.002s 0m23.987s 0m0.004s
bench-naive-64 0m24.160s 0m24.148s 0m0.001s
bench-sse41-64 0m26.897s 0m26.879s 0m0.006s
bench-naive-128 0m29.302s 0m29.284s 0m0.005s
bench-sse41-128 1m29.741s 1m29.700s 0m0.006s