Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track added products for cache invalidation #569

Closed
wants to merge 2 commits into from
Closed

Track added products for cache invalidation #569

wants to merge 2 commits into from

Conversation

eatkins
Copy link
Contributor

@eatkins eatkins commented Aug 9, 2018

I discovered that in sbt, the scripted tests/internal-tracking test
would fail if run with -Dsbt.resident.limit=$NUM where $NUM > 0. This
was because the test introduced a dependent project b, that depended on
a, but was set to not track a. If the b project was compiled before the
a project, the compilation would fail, but a cache entry would be added
to the compiler cache with a key that included a's target directory in
its classpath. If a was then independently compiled, the expectation
would be that b would also compile, but this actually failed. The reason
was that when zinc looked for the cached compiler, it found a hit and
did not create a new compiler instance. This was a problem because the
previous instance was unaware of the newly generated classes from the a
project. To get b to compile, it is necessary to invalidate the compiler
cache entry.

To invalidate the entry, we must detect that there are changes to the
project's dependencies. Previously, zinc did not even look at new
classes in a dependencies target directory. It did look at
removedProducts, but did not actually consider the removed products when
constructing the DependencyChanges instance in Incremental.scala. This
commit adds an addedProducts* field to InitialChanges with alternate
constructors and apply methods for binary compatibility. I then update
Incremental.scala to append both the removed and added products to the
DependencyChanges.modifiedBinaries parameter. After those two changes,
the scripted test passes with or without using cached compilers. Note
that this behavior was never detected in the past because all of the
scripted tests are run without setting sbt.resident.limit so every
single run of compile is guaranteed to have a fresh compiler.

  • I could have left InitialChanges as it was and just appended the new
    products to the removedProducts set, but that seems liable to lead to
    confusion.

@typesafe-tools
Copy link

The validator has checked the following projects against Scala 2.12,
tested using dbuild, projects built on top of each other.

Project Reference Commit
sbt 1.2.x sbt/sbt@549c694
zinc pull/569/head 4761a63
io 1.2.x sbt/io@fb0acb1
librarymanagement 1.2.x sbt/librarymanagement@f42ec6f
util 1.2.x sbt/util@6a9c7c9
website 1.2.x

❌ The result is: FAILED
(restart)

@jvican
Copy link
Member

jvican commented Aug 10, 2018

@eatkins Hey Ethan, resident compilation has not been working for more than three years (it started as an experiment and it was then dropped). Since then, Zinc 1.0 hasn't been tweaked to work with resident compilation and I would prefer not to tweak it until there's a working and reliable resident compilation working in the scalac compiler (this is not the case today).

@eatkins
Copy link
Contributor Author

eatkins commented Aug 10, 2018

@jvican that's interesting. I have actually been using it for the last 6 months without much in the way of noticeable issues. What I have noticed is that it often takes a few hundred milliseconds to create a new compiler instance so using a cached compiler is a noticeable improvement when it works. Do you remember off hand what doesn't work when using the cached compiler? Also, if cached compilers are unsupported, why are they not deprecated both in zinc and in sbt?

I discovered that in sbt, the scripted tests/internal-tracking test
would fail if run with -Dsbt.resident.limit=$NUM where $NUM > 0. This
was because the test introduced a dependent project b, that depended on
a, but was set to not track a. If the b project was compiled before the
a project, the compilation would fail, but a cache entry would be added
to the compiler cache with a key that included a's target directory in
its classpath. If a was then independently compiled, the expectation
would be that b would also compile, but this actually failed. The reason
was that when zinc looked for the cached compiler, it found a hit and
did not create a new compiler instance. This was a problem because the
previous instance was unaware of the newly generated classes from the a
project. To get b to compile, it is necessary to invalidate the compiler
cache entry.

To invalidate the entry, we must detect that there are changes to the
project's dependencies. Previously, zinc did not even look at new
classes in a dependencies target directory. It did look at
removedProducts, but did not actually consider the removed products when
constructing the DependencyChanges instance in Incremental.scala. To
resolve this, I just append the removed and new products to the binary
dependency changes. After this change,
`scripted project/internal-tracking` passes even when a global compiler
cache is used. The test I added to MultiProjectIncrementalSpec also
fails without this change and passes with it.
@eatkins
Copy link
Contributor Author

eatkins commented Aug 10, 2018

For context, I have been trying to reduce the time between triggering a build and the start of compilation in sbt. When you optimize the other initialization steps (gathering source files and comparing stamps), creating a new compiler instance ends up taking up most of the time. I have been able to get continuous builds using ~compile for small projects to complete within 100ms of touching a simple file, e.g. class Foo. That number increases by 50ms-300ms depending on how warm the the jvm is when a fresh compiler instance is created for the run. I suppose it doesn't affect ux that much, but I'm personally of the mindset that every millisecond matters for software latency especially when it creeps up to the 200ms range, which is noticeable.

@jvican
Copy link
Member

jvican commented Aug 10, 2018

What I have noticed is that it often takes a few hundred milliseconds to create a new compiler instance so using a cached compiler is a noticeable improvement when it works.

The initialization time of the compiler is negligible when the JVM is hot. I was not the guy doing experiments with resident compilation but from what I know, it was deemed not worth it compared to the speedups it gave. Down the road, I'd love to experiment with that but until then I would prefer to keep things as they are in Zinc.

For context, I have been trying to reduce the time between triggering a build and the start of compilation in sbt. When you optimize the other initialization steps (gathering source files and comparing stamps), creating a new compiler instance ends up taking up most of the time. I have been able to get continuous builds using ~compile for small projects to complete within 100ms of touching a simple file, e.g. class Foo. That number increases by 50ms-300ms depending on how warm the the jvm is when a fresh compiler instance is created for the run. I suppose it doesn't affect ux that much, but I'm personally of the mindset that every millisecond matters for software latency especially when it creeps up to the 200ms range, which is noticeable.

I agree with you we can do much better in this regard. However, I'd like to give you some data points that I've gotten from using Bloop, which I encourage you to try if you care about compilation performance (we've seen from 30% to 2x faster compilation times for big builds).

Below there is graph that measures the compilation of a small module defining a few tests with both sbt and bloop. Bloop compiles the files in 62 ms, 3x faster than sbt:

screen shot 2018-08-10 at 7 07 47 pm

It largely depends on your project, but this benchmark illustrates well that bloop does much better at compiling incrementally than sbt (incremental compilation only compiles a few source files at once in the most common case).

That number increases by 50ms-300ms depending on how warm the the jvm is when a fresh compiler instance is created for the run.

Now coming back to resident compilation. In order to do resident compilation correctly the compiler needs to be able to invalidate the symbol table whenever one of the classpath entries is modified. From what I know, no Scala compiler does that yet (although I've been doing work in this area that is bringing us closer to it) and therefore using resident compilation is not safe and it's discouraged.

It would be great, though, if you can give me numbers of using resident compilation with and without bloop so that we can see how much speedup it gives us compared to the state-of-the-art. This will give us a lot of insight and who knows, maybe it encourages us to prioritize it in the close future.

@eatkins
Copy link
Contributor Author

eatkins commented Aug 10, 2018

Thanks for the clarification! I have been working on integrating an optional file system cache into sbt that automatically updates itself and recalculates hashes when it detects source file changes. This is how I've been able to reduce the compilation initialization time in sbt. It will be interesting to see if my changes can improve the performance relative to bloop. If not, I think I understand the pipeline well enough to hopefully be able to integrate the bloop performance improvements into sbt. As an experiment, I also integrated the file system cache into mill and did some rudimentary performance tests and I actually found sbt to be slightly faster than mill for large projects, but this was a few months ago and I'm not sure if that result will hold up.

@jvican
Copy link
Member

jvican commented Aug 10, 2018

Thanks for the clarification! I have been working on integrating an optional file system cache into sbt that automatically updates itself and recalculates hashes when it detects source file changes.

We have something similar to this in Bloop, and there's still work to do to share all of these hashes along all of its users inside the bloop codebase. I registered this idea in this thread:

One thing to consider is to associate hashes with classpath entries, and allow any compilation process to reuse those hashes. But if we do that, we need to think about proper invalidation. All options are...

So at some point I'll get back to it if nobody beats me to it.

@eatkins
Copy link
Contributor Author

eatkins commented Aug 10, 2018

Cool. It's good to see that my approach is inline with what you and others have considered for bloop. One pie in the sky idea is to have the file system cache actually include the contents of the source files. If the compiler exposed an interface that took a list of of source files along with the contents of the source files, then the compiler would not need to do any disk io to compile the sources. I am not really sure how disk bound compilation is, so maybe it wouldn't matter. I was initially surprised to discover the extent to which sbt can be heavily disk io bound though.

@jvican
Copy link
Member

jvican commented Aug 10, 2018

I am not really sure how disk bound compilation is, so maybe it wouldn't matter. I was initially surprised to discover the extent to which sbt can be heavily disk io bound though.

In the large scheme of things, compilation is not disk intensive. However, to gain some speedups from IO I've recommended several Bloop users to just use tmpfs directories as target directories and to rebase all the classpath entries of all the projects in a tmpfs directory. Tmpfs directories give you more throughput and less latency than reading/writing from/to disk, so I imagine that this alone would give you performance similar to the one you can get from swovel (but obviously I haven't tested this).

Note that my approach in bloop does not aim to store the contents of the source files, but rather just cache the "meta" information that I'm interested about and expose it to the different parts of the compiler APIs that need it. This kind of meta information is things like hashes (real hashes, not timestamps).

@eatkins
Copy link
Contributor Author

eatkins commented Nov 11, 2018

@jvican I remain a bit confused about your statement that resident compilation doesn't work in scalac. I have been using it in all of my projects for a year and have not had any issues with it. The only issue I've ever encountered with it was when I set resident compilation to be the default in sbt and ran the sbt scripted tests. Even then, there was only one test that failed and it was doing some weird stuff with the project tracking configuration and this PR fixes that test when run with resident compilation enabled.

@dwijnand
Copy link
Member

The resident compiler was removed (#835), so I guess this can be closed too.

@dwijnand dwijnand closed this Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants