Bitset optimisations

This wikipage lays out our optimisation attempts to improve the performance of bitset.c. To not require portability between platforms we may improve performance on specific platforms. Alternatively some code may have been optimised in upstream libraries. Here we compare our code.

bitset_clear

We use bitset_clear to set all bits to zero after allocation, hence calloc may have been called already and precious operations took place. Overshooting the requested number of bit doesn't hurt here, we never count the number of zero bits in our bitset.

Naive implementation

The naive implementation of bitset_clear iterates over all chunks used for allocation. An initial optimisation has been applied to use a decreasing while-loop instead of an increasing for-loop.

void bitset_clear(bitset_t *self) {
    uint32_t i_chunk = self->n_chunks;

    while (i_chunk) {
        i_chunk--;
        self->chunks[i_chunk] = (bits_t) 0;
    }
}

Alternative implementations

An alternative approaches are the now deprecrated void bzero(void *s, size_t n); and void *memset(void *s, int c, size_t n); from strings.h.

void bitset_clear(bitset_t *self) {
    bzero (self->chunks, sizeof (bits_t) * self->n_chunks);
}

void bitset_clear(bitset_t *self) {
    memset (self->chunks, 0, sizeof (bits_t) * self->n_chunks);
}

Benchmark

#include "bitset.h"

int main() {
    uint64_t i = 10000;
    bitset_t *bitset = bitset_new(100000000);
    while (i) {
        i--;
        bitset_clear(bitset);
    }
}

gcc -O2 -o bench-naive benchmark.c bitset.c
gcc -O2 -o bench-bzero benchmark.c bitset-bzero.c
gcc -O2 -o bench-memset benchmark.c bitset-memset.c

operation        real       user       sys
bench-naive      0m23.653s  0m23.625s  0m0.002s
bench-bzero      0m13.596s  0m13.572s  0m0.008s
bench-memset     0m13.586s  0m13.566s  0m0.004s

bitset_get

We use bitset_get to get a single bit from a bitstring which is split over many different chunks.

Naive implementation

The naive implementation of bitset_get finds the correct memory chunk (index >> BS_SHIFT) and applies a mask for the nth-bit in this memory block.

bool bitset_get(bitset_t *self, uint32_t index) {
    return self->chunks[index >> BS_SHIFT] & ((bits_t) 1) << (index & (BS_BITS - 1));
}

SSE4.1 implementation

The SSE4.1 processor extension allow for efficient processing of 128bit values. For this benchmark we first test an implementation using uint64_t values. SSE4.1 doesn't allow to leftshift an arbitrary number of bits, it shifts by 8. For this reason we naively compute the mask first.

bool bitset_get(bitset_t *self, uint32_t index) {
    bits_t _mask = ((bits_t) 1) << (index & (BS_BITS - 1));

    __m128i x = _mm_setr_epi64 ((__m64) (uint64_t) 0, (__m64) self->chunks[index >> BS_SHIFT]);
    __m128i mask = _mm_setr_epi64 ((__m64) (uint64_t) 0, (__m64) _mask);
    __m128i t = _mm_and_si128 (x, mask);

    return !_mm_testz_si128 (t,t);
}

Our second implementation should test 128bit values. The performance is so poor we wonder if someone has a clue what might have gone wrong.

bool bitset_get(bitset_t *self, uint32_t index) {
    bits_t _mask = ((bits_t) 1) << (index & (BS_BITS - 1));

    __m128i x = _mm_loadu_si128 ((__m128i *) &self->chunks[index >> BS_SHIFT]);
    __m128i mask = _mm_loadu_si128 ((__m128i *) &_mask);
    __m128i t = _mm_and_si128 (x, mask);

    return !_mm_testz_si128 (t, t);
}

Alternative implementations

Other methods exist to use instead of left shifting to the mask value. An option for example is to use a lookup table.

Benchmark

#include "bitset.h"

int main() {
    uint32_t i = 100;
    bitset_t *bitset = bitset_new(100000000);
    bitset_clear(bitset);

    while (i) {
        uint32_t j = 100000000;

        i--;
        while (j) {
            j--;
            bitset_get (bitset, j);
        }
    }
}

gcc -O2 -march=native -DRRRR_BITSET_32  -o bench-naive-32  benchmark.c bitset-memset.c
gcc -O2 -march=native -DRRRR_BITSET_64  -o bench-naive-64  benchmark.c bitset-memset.c
gcc -O2 -march=native -DRRRR_BITSET_64  -o bench-sse41-64  benchmark.c bitset-memset-sse41.c
gcc -O2 -march=native -DRRRR_BITSET_128 -o bench-naive-128 benchmark.c bitset-memset.c
gcc -O2 -march=native -DRRRR_BITSET_128 -o bench-sse41-128 benchmark.c bitset-memset-sse41-128.c

operation        real       user       sys
bench-naive-32   0m24.002s  0m23.987s  0m0.004s
bench-naive-64   0m24.160s  0m24.148s  0m0.001s
bench-sse41-64   0m26.897s  0m26.879s  0m0.006s
bench-naive-128  0m29.302s  0m29.284s  0m0.005s
bench-sse41-128  1m29.741s  1m29.700s  0m0.006s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bitset optimisations

Bitset optimisations

bitset_clear

Naive implementation

Alternative implementations

Benchmark

bitset_get

Naive implementation

SSE4.1 implementation

Alternative implementations

Benchmark

Clone this wiki locally