-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace slow SHA-1 by xxHash for classpath hashes #371
Closed
Commits on Jul 18, 2017
-
Replace slow SHA-1 by xxHash for classpath hashes
This commit replaces `SHA-1` with `xxHash`, an extremely fast non-cryptographic hash algorithm that is used extensively in production systems all over the world (check the website: http://cyan4973.github.io/xxHash/). The reason why this new hashing function has been added is because `MessageDigest` and, concretely, `MixedAnalyzingCompiler.config` showed up in some profiles consuming too much cpu time: ``` sbt.internal.inc.MixedAnalyzingCompiler$.$anonfun$makeConfig$1(File) 1213ms ``` The culprit is `Stamper.forHash` that relies on `MessageDigest` to checksum the jars on the fly: ``` sbt.internal.inc.Stamper$.$anonfun$forHash$2(File) 1213ms -> FilterInputStream.java:107 java.security.DigestInputStream.read(byte[], int, int) 1146ms ``` However, it is not reasonable that for not such a big classpath (sbt's main module classpath) this takes 1213ms. This is totally exaggerated, and it's a price that it's paid every time we create a `MixedAnalyzingCompiler`, which is instantiated every time we invoke `compileIncrementally` in the `IncrementalCompilerImpl`. Basically, in every compile iteration. The benefit of this change is motivated by the following scenarios: 1. Users incrementally compile lots of times. Reducing the running time of every iteration makes the incremental compilation start faster. 2. Big projects can have **huge** classpaths which will make incremental compilation significantly slower. This is especially true when every jar is big. There are faster hashing functions than xxHash. However, given that xxHash is used in important projects and has been battle-tested, I've decided to use it for Zinc. Note that cryptographic hashes are not necessary to ensure that the classpath has or has not changed. It is safe to use `xxHash` since the collision rate is really low.
Configuration menu - View commit details
-
Copy full SHA for 61926e8 - Browse repository at this point
Copy the full SHA 61926e8View commit details -
The new `Hash32` class is designed to wrap a 32-byte integer product of the underlying hash algorithm. Instead of making this accept a `Long` (xxHash works at the 64-byte level), we produce `Int` because `FileHash` needs an integer and we don't want to break that API. Truncating from 64-byte to 32-byte is safe, read http://fastcompression.blogspot.ch/2014/07/xxhash-wider-64-bits.html. The following commit hence removes any reference to the previous `Hash`, that was using hexadecimal values, and deprecates any method that assumes its existence `getHash` in the `Stamp` Java interface. It also makes the changes in the underlying formats, the binary change being binary compatible.
Configuration menu - View commit details
-
Copy full SHA for 56d38b4 - Browse repository at this point
Copy the full SHA 56d38b4View commit details -
Migrate away from 32-byte hashes
See discussion in sbt#371. Motivations to have a 64-byte hashes for all the file hashes: 1. In a distributed environment, 32-byte is not sufficient to ensure that there are no collisions. 2. The quality of the 64-byte hash is supposed to be greater than 32. 3. Previously, we were using the `hashCode` as a 32-byte hash for the hexadecimal representation of a SHA1 of some file contents. This is unsafe, and by switching to `long` we can also make the underlying hash file content implementation to be 64-byte based.
Configuration menu - View commit details
-
Copy full SHA for b466a40 - Browse repository at this point
Copy the full SHA b466a40View commit details
Commits on Jul 19, 2017
-
Mmap jars and sources to be hashed
The following commit removes the streaming by mmapping the contents of the files to be hashed, which is faster and better because no copying occurs between OS regions and userbase regions, meaning that the OS optimizes its access in virtual memory. This also means that the subsequent times that this method is executed it will be faster since the memory will be already mapped. Mapping my whole ivy cache is 561ms the first time, and around ~350ms for the rest of the time.
Configuration menu - View commit details
-
Copy full SHA for 733d0fd - Browse repository at this point
Copy the full SHA 733d0fdView commit details -
Turn hash type to
byte[]
in all APIsThis affects the `FileHash` and `Hash64` APIs. This will allow us to store the representation no matter what the hash is. It's friendlier to API changes in the future, as Stu has mentioned in the review. See https://github.com/sbt/zinc/pull/371/files for discussion.
Configuration menu - View commit details
-
Copy full SHA for d7e85a1 - Browse repository at this point
Copy the full SHA d7e85a1View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.