Replace slow SHA-1 by xxHash for classpath hashes #371

This commit replaces `SHA-1` with `xxHash`, an extremely fast non-cryptographic hash algorithm that is used extensively in production systems all over the world (check the website: http://cyan4973.github.io/xxHash/). The reason why this new hashing function has been added is because `MessageDigest` and, concretely, `MixedAnalyzingCompiler.config` showed up in some profiles consuming too much cpu time: ``` sbt.internal.inc.MixedAnalyzingCompiler$.$anonfun$makeConfig$1(File) 1213ms ``` The culprit is `Stamper.forHash` that relies on `MessageDigest` to checksum the jars on the fly: ``` sbt.internal.inc.Stamper$.$anonfun$forHash$2(File) 1213ms -> FilterInputStream.java:107 java.security.DigestInputStream.read(byte[], int, int) 1146ms ``` However, it is not reasonable that for not such a big classpath (sbt's main module classpath) this takes 1213ms. This is totally exaggerated, and it's a price that it's paid every time we create a `MixedAnalyzingCompiler`, which is instantiated every time we invoke `compileIncrementally` in the `IncrementalCompilerImpl`. Basically, in every compile iteration. The benefit of this change is motivated by the following scenarios: 1. Users incrementally compile lots of times. Reducing the running time of every iteration makes the incremental compilation start faster. 2. Big projects can have **huge** classpaths which will make incremental compilation significantly slower. This is especially true when every jar is big. There are faster hashing functions than xxHash. However, given that xxHash is used in important projects and has been battle-tested, I've decided to use it for Zinc. Note that cryptographic hashes are not necessary to ensure that the classpath has or has not changed. It is safe to use `xxHash` since the collision rate is really low.

The new `Hash32` class is designed to wrap a 32-byte integer product of the underlying hash algorithm. Instead of making this accept a `Long` (xxHash works at the 64-byte level), we produce `Int` because `FileHash` needs an integer and we don't want to break that API. Truncating from 64-byte to 32-byte is safe, read http://fastcompression.blogspot.ch/2014/07/xxhash-wider-64-bits.html. The following commit hence removes any reference to the previous `Hash`, that was using hexadecimal values, and deprecates any method that assumes its existence `getHash` in the `Stamp` Java interface. It also makes the changes in the underlying formats, the binary change being binary compatible.

See discussion in sbt#371. Motivations to have a 64-byte hashes for all the file hashes: 1. In a distributed environment, 32-byte is not sufficient to ensure that there are no collisions. 2. The quality of the 64-byte hash is supposed to be greater than 32. 3. Previously, we were using the `hashCode` as a 32-byte hash for the hexadecimal representation of a SHA1 of some file contents. This is unsafe, and by switching to `long` we can also make the underlying hash file content implementation to be 64-byte based.

The following commit removes the streaming by mmapping the contents of the files to be hashed, which is faster and better because no copying occurs between OS regions and userbase regions, meaning that the OS optimizes its access in virtual memory. This also means that the subsequent times that this method is executed it will be faster since the memory will be already mapped. Mapping my whole ivy cache is 561ms the first time, and around ~350ms for the rest of the time.

This affects the `FileHash` and `Hash64` APIs. This will allow us to store the representation no matter what the hash is. It's friendlier to API changes in the future, as Stu has mentioned in the review. See https://github.com/sbt/zinc/pull/371/files for discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace slow SHA-1 by xxHash for classpath hashes #371

Replace slow SHA-1 by xxHash for classpath hashes #371

Commits on Jul 18, 2017

Commits on Jul 19, 2017