Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace slow SHA-1 by xxHash for classpath hashes #371

Closed
wants to merge 5 commits into from

Commits on Jul 18, 2017

  1. Replace slow SHA-1 by xxHash for classpath hashes

    This commit replaces `SHA-1` with `xxHash`, an extremely fast
    non-cryptographic hash algorithm that is used extensively in production
    systems all over the world (check the website:
    http://cyan4973.github.io/xxHash/).
    
    The reason why this new hashing function has been added is because
    `MessageDigest` and, concretely, `MixedAnalyzingCompiler.config` showed
    up in some profiles consuming too much cpu time:
    
    ```
    sbt.internal.inc.MixedAnalyzingCompiler$.$anonfun$makeConfig$1(File) 1213ms
    ```
    
    The culprit is `Stamper.forHash` that relies on `MessageDigest` to
    checksum the jars on the fly:
    
    ```
    sbt.internal.inc.Stamper$.$anonfun$forHash$2(File) 1213ms
    -> FilterInputStream.java:107 java.security.DigestInputStream.read(byte[], int, int) 1146ms
    ```
    
    However, it is not reasonable that for not such a big classpath (sbt's
    main module classpath) this takes 1213ms. This is totally exaggerated,
    and it's a price that it's paid every time we create a
    `MixedAnalyzingCompiler`, which is instantiated every time we invoke
    `compileIncrementally` in the `IncrementalCompilerImpl`. Basically,
    in every compile iteration.
    
    The benefit of this change is motivated by the following scenarios:
    
    1. Users incrementally compile lots of times. Reducing the running time
    of every iteration makes the incremental compilation start faster.
    
    2. Big projects can have **huge** classpaths which will make incremental
    compilation significantly slower. This is especially true when every jar
    is big.
    
    There are faster hashing functions than xxHash. However, given that
    xxHash is used in important projects and has been battle-tested, I've
    decided to use it for Zinc.
    
    Note that cryptographic hashes are not necessary to ensure that the
    classpath has or has not changed. It is safe to use `xxHash` since the
    collision rate is really low.
    jvican committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    61926e8 View commit details
    Browse the repository at this point in the history
  2. Replace old Hash by Hash32

    The new `Hash32` class is designed to wrap a 32-byte integer product of
    the underlying hash algorithm. Instead of making this accept a `Long`
    (xxHash works at the 64-byte level), we produce `Int` because `FileHash`
    needs an integer and we don't want to break that API.
    
    Truncating from 64-byte to 32-byte is safe, read
    http://fastcompression.blogspot.ch/2014/07/xxhash-wider-64-bits.html.
    
    The following commit hence removes any reference to the previous `Hash`,
    that was using hexadecimal values, and deprecates any method that
    assumes its existence `getHash` in the `Stamp` Java interface.
    
    It also makes the changes in the underlying formats, the binary change
    being binary compatible.
    jvican committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    56d38b4 View commit details
    Browse the repository at this point in the history
  3. Migrate away from 32-byte hashes

    See discussion in sbt#371.
    
    Motivations to have a 64-byte hashes for all the file hashes:
    
    1. In a distributed environment, 32-byte is not sufficient to ensure
       that there are no collisions.
    2. The quality of the 64-byte hash is supposed to be greater than 32.
    3. Previously, we were using the `hashCode` as a 32-byte hash for the
       hexadecimal representation of a SHA1 of some file contents. This is
       unsafe, and by switching to `long` we can also make the underlying
       hash file content implementation to be 64-byte based.
    jvican committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    b466a40 View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2017

  1. Mmap jars and sources to be hashed

    The following commit removes the streaming by mmapping the contents of
    the files to be hashed, which is faster and better because no copying
    occurs between OS regions and userbase regions, meaning that the OS
    optimizes its access in virtual memory.
    
    This also means that the subsequent times that this method is executed
    it will be faster since the memory will be already mapped. Mapping my
    whole ivy cache is 561ms the first time, and around ~350ms for the rest
    of the time.
    jvican committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    733d0fd View commit details
    Browse the repository at this point in the history
  2. Turn hash type to byte[] in all APIs

    This affects the `FileHash` and `Hash64` APIs.
    
    This will allow us to store the representation no matter what the hash
    is. It's friendlier to API changes in the future, as Stu has mentioned
    in the review.
    
    See https://github.com/sbt/zinc/pull/371/files for discussion.
    jvican committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    d7e85a1 View commit details
    Browse the repository at this point in the history