Hashtree is a SHA256 library highly optimized for Merkle tree computation. It is based on Intel's implementation with a few modifications like hardcoding the scheduled words of the padding block.
The library exposes a single header file with the low level sha functions. They all have the following signature:
void sha256(unsigned char *output, unsigned char *input, uint64_t count)
where input
is the buffer containing count
chunks of 64 bytes to be hashed,
and output
is a pre-allocated buffer that will hold the count
32-byte digests.
These are low level functions that do not perform any error check: the caller is
responsible for checking that input
is at least 64*count
bytes long and that
output
is writable and at least 32*count
bytes long. The caller is
responsible for memory allocation and de-allocation of these buffers.
There are no dependencies besides the standard C
header stdint.h
. Benchmarks
have a dependency on libm
. Tests and benchmarks on x86-64 an extra
dependency on cpuid.h
is needed. An optional dependency on openssl allows to
test and benchmark against openssl. The only build-time dependency is a GCC and
GNU assembler compatible compiler like gcc
and gas
. On Mac OS X with newer Apple Silicon processors the library can be built with the default clang compiler.
- Start by cloning the repository
$ git clone https://github.com/potuz/hashtree.git
$ cd hashtree
- To build the library
$ make
this produces a statically linked library libhashtree.a
in the src
directory.
- To build tests or benchmarks or all respectively:
$ make test
$ make bench
$ make all
- To test or benchmark against OPENSSL:
$ make clean
$ HAVE_OPENSSL=1 make all
- To cross compile for ARMv8
$ CC=aarch64-linux-gnu-gcc make
- To cross compile for Windows (benchmarks will not work)
$ make clean
$ CC=x86_64-w64-mingw32-gcc make test
$ make test
$ ./src/test
Test hash_sse_1... [ OK ]
Test hash_sse_1_multiple_blocks... [ OK ]
Test hash_avx_1... [ OK ]
Test hash_avx_1_multiple_blocks... [ OK ]
Test hash_avx_4... [ OK ]
Test hash_avx_4_6blocks... [ OK ]
Test hash_avx_8... [ OK ]
Test hash_avx_8_13blocks... [ OK ]
Test hash_shani... [ CPU does not support SHA-ni ]
Test hash_shani_13blocks... [ CPU does not support SHA-ni ]
Test hash_avx_16... [ OK ]
Test hash_avx_16_30blocks... [ OK ]
This is running in a CPU that does not support SHA extensions, that is why two tests fail. Your system may output a different combination.
To run benchmarks:
$ HAVE_OPENSSL=1 make all
$ ./src/bench
[==========] Running 9 benchmarks.
[ RUN ] sse.sse_x1_one_at_time
[ OK ] sse.sse_x1_one_at_time (mean 32.300ms, confidence interval +- 1.725613%)
[ RUN ] sse.sse_x1
[ OK ] sse.sse_x1 (mean 31.837ms, confidence interval +- 0.327542%)
[ RUN ] avx.avx_x1_one_at_time
[ OK ] avx.avx_x1_one_at_time (mean 32.299ms, confidence interval +- 0.464452%)
[ RUN ] avx.avx_x1
[ OK ] avx.avx_x1 (mean 31.855ms, confidence interval +- 0.150833%)
[ RUN ] avx.avx_x4
[ OK ] avx.avx_x4 (mean 12.368ms, confidence interval +- 1.758262%)
[ RUN ] avx.avx_x8
[ OK ] avx.avx_x8 (mean 6.519ms, confidence interval +- 2.142718%)
[ RUN ] shani.shani
[ OK ] shani.shani (mean -20.-559us, confidence interval +- -89188758235664.375000%)
[ RUN ] shani.shani_one_at_time
[ OK ] shani.shani_one_at_time (mean -3281090326183.-673us, confidence interval +- -9846.737643%)
[ RUN ] openssl.openssl_one_at_time
[ OK ] openssl.openssl_one_at_time (mean 30.519ms, confidence interval +- 0.330545%)
[==========] 9 benchmarks ran.
[ PASSED ] 9 benchmarks.
The results for the SHA-ni benchmarks are spurious since the CPU does not support them. Here we see that the benchmark against openssl native implementation for this CPU runs at the same speed as the single buffer SSE and AVX implementation, while the AVX2 implementation 8 blocks at a time runs 5 times faster.
A benchmark on a cascade-lake supporting AVX-512:
./src/bench
[==========] Running 9 benchmarks.
[ RUN ] sse.sse_x1_one_at_time
[ OK ] sse.sse_x1_one_at_time (mean 29.182ms, confidence interval +- 0.149473%)
[ RUN ] sse.sse_x1
[ OK ] sse.sse_x1 (mean 28.833ms, confidence interval +- 0.074605%)
[ RUN ] avx.avx_x1_one_at_time
[ OK ] avx.avx_x1_one_at_time (mean 29.205ms, confidence interval +- 0.138581%)
[ RUN ] avx.avx_x1
[ OK ] avx.avx_x1 (mean 28.871ms, confidence interval +- 0.200034%)
[ RUN ] avx.avx_x4
[ OK ] avx.avx_x4 (mean 11.078ms, confidence interval +- 0.140484%)
[ RUN ] avx.avx_x8
[ OK ] avx.avx_x8 (mean 5.650ms, confidence interval +- 0.118668%)
[ RUN ] avx.avx_x16
[ OK ] avx.avx_x16 (mean 2.413ms, confidence interval +- 0.223049%)
[ RUN ] shani.shani
[ OK ] shani.shani (mean -4.-941us, confidence interval +- -1102817647134393.500000%)
[ RUN ] shani.shani_one_at_time
[ OK ] shani.shani_one_at_time (mean 0.-140us, confidence interval +- -70078699044457400.000000%)
[==========] 9 benchmarks ran.
[ PASSED ] 9 benchmarks.
We see that AVX-512 (x16) runs 12 times faster than a single block implementation. This is slightly better than a native SHA extension CPU were gains were about x10.
A similar benchmark on a Raspberry Pi 4 model B:
$ ./src/bench
[==========] Running 3 benchmarks.
[ RUN ] armv8.neon_x1_one_at_time
[ OK ] armv8.neon_x1_one_at_time (mean 79.853ms, confidence interval +- 0.157599%)
[ RUN ] armv8.neon_x1
[ OK ] armv8.neon_x1 (mean 79.035ms, confidence interval +- 0.070254%)
[ RUN ] armv8.neon_x4
[ OK ] armv8.neon_x4 (mean 58.356ms, confidence interval +- 0.076089%)
[==========] 3 benchmarks ran.
[ PASSED ] 3 benchmarks.
We see that a ASIMD version 4 blocks at a time, while not that much of an improvement as in the x86-64, is still 27% faster.
The library exposes several architecture dependent SHA implementations. It is the caller responsibility to choose the right one. This can be done at runtime once at application launch. For x86_64 systems for example one can use cpuid.h, see here for an example on how to choose an implementation.
Most vectorized implementations exploit the fact that independent branches in the Merkle tree can be hashed in "parallel" within one CPU, to take advantage of this, Merkleization algorithms that loop over consecutive tree layers hashing two blocks at a time need to be updated to pass the entire layer, or all consecutive blocks. A naive example on how to accomplish this can be found in this document
Some examples benchmarks running the algorithms in this library vs prysm's current implementation are
goos: linux
goarch: amd64
cpu: AMD Ryzen 5 3600 6-Core Processor
BenchmarkHashBalanceShani-12 160 7629704 ns/op
BenchmarkHashBalanceShaniPrysm-12 15 74012328 ns/op
PASS
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
BenchmarkHashBalanceAVX-4 68 26677965 ns/op
BenchmarkHashBalancePrysm-4 7 165434686 ns/op
PASS
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkHashBalanceAVX2-4 121 9711482 ns/op
BenchmarkHashBalancePrysm-4 10 103716714 ns/op
PASS
The library offers low-level bindings for Nim that can be installed using:
nimble install https://github.com/prysmaticlabs/hashtree/
or used in a package with:
requires "https://github.com/prysmaticlabs/hashtree/"
At the top directory you can run
$ make rust_bindings
To run tests:
$ make rust_tests
See the examples
directory for examples on how to use the library