Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class dependencies #86

Merged
merged 91 commits into from
Mar 30, 2016
Merged

Class dependencies #86

merged 91 commits into from
Mar 30, 2016

Conversation

gkossakowski
Copy link
Contributor

@gkossakowski gkossakowski commented Mar 19, 2016

This PR implements class-based dependency tracking. The basic idea is simple: let's track dependencies at class level instead of at source level. Dependencies on inner classes are tracked separately. Also, APIs (to detect which dependencies has changed) are tracked for each class separately.

The detailed list of what has been done and what's left to be done was tracked in sbt/sbt#1104. I'm pasting it here for convenience.


Tasks

Class-based dependency tracking can be implemented in two phases. The third phase is dedicated to testing on large projects.

Invalidate by inheritance based on class dependencies (phase 1)

Tasks marked as completed are completed only in a prototype implementation. No changes has been merged to sbt yet.

  • add tracking of declared classes in source files; we want to invalidate classes but at the end we need className -> srcFile relation because we can recompile source files only
  • add dependency tracking of top-level (not owned by any other class) imports; we'll assign dependencies introduced by those imports to the first top level class/object/trait declared in the source file (see notes below for details)
  • track dependencies at class level (and use declaredClasses to immediately map them back to files until the rest of the algorithm is implemented)
  • add tracking of children of sealed parents; we should track this in API object corresponding to sealed parent; this will enable us to perform proper invalidation when a new case (class) introduced to a sealed hierarchy
  • add per (top level) class tracking of api hashes; tracking of api hashes per each (even nested) class separately will be done in phase 2
  • handle dependencies coming from local and anonymous classes (see discussion below)
  • switch invalidation of dependencies by inheritance to invalidate at class level instead of at source level (would fix #2320)
  • distinguish between source and binary class names. Introduce a new relation binaryClassName: Relation[String, String] and use it whenever the conversion is needed.

Track name hashes per class (phase 2)

  • refactor tracking of APIs to be done at class level instead of at source level
  • extract APIs of all classes (including inner ones) separately
  • track name hashes at class level instead of being at source level (see also: #2319)
  • implement the rest of invalidation logic (member reference, name hashing) based on class names (get rid of mapping through declaredClasses in almost entire algorithm) (would fix #2320)

Only once the last bullet is merged we'll see improvement to incremental compilation when big number of nested classes is involved (e.g. like in Scala compiler's case).

Testing, bug fixing and benchmarking (phase 3)

  • test with the Scalatest repo
  • test with the scala repo using the sbt build
  • test with the specs2 repo
  • fix handling of anonymous and local classes defined in Java (similar logic will have to be implemented as for handling local Scala classes)
  • index classes only declared (not inherited) in source files
  • benchmark ;clean;compile performance
  • simplify classpath and Analysis instance lookup in incremental compiler (I think number of classpath lookups can be reduced now)

Merge changes upstrem, prepare for shipping (phase 4)

  • determine the location where this work will be merged into

Benefits

The improved dependency tracking delivers up to 40x speedups of incremental compilation in scenarios tested. Check benchmarking results here: sbt/sbt#1104 (comment)

The speedups are caused by fixing two main issues:

  1. The #2319 would be fixed once name hashes are tracked per class. This way introduction of a new class and members coming with it would not affect source files dependent (by member ref) on already existing classes.
  2. Effects of adding members (like methods) to a class would affect only classes that inherit from that class. At the moment, adding a member to a class that nobody inherits from can trigger invalidation of all descendants of all classes defined in the same source file (see #2320 for a specific scenario).

The 1. is likely to be triggered by any code base that uses more than one class defined in a single source file. The 2. is affecting code bases with big number of nested classes that are inherited.
One example is Scala compiler itself. Even with name hashing, we invalidate too much and code edit cycle becomes long whenever a new member is introduced.


Known issues

Backwards compatibility

This PR affects backwards compatibility in several ways:

  • support for Scala 2.7, 2.8 and 2.9 has been dropped
  • support for Analysis.groupBy operation has been dropped (it's not clear who needs it)
  • the API of Analysis object is very different now because most methods take classNames as arguments instead of file names

The inferred type is not great so let's enforce the right existential
explicitly.
Remove implementation of Relations that supports the old incremental
compilation algorithm. This will simplify changing relations to be
class-based.
Since introduction of TextAnalysisFormat, most of AnalysisFormat is
dead code. Some bits seem to be still used in various parts of sbt, though.

Updated TextAnalysisFormatSpecification to not use deprecated methods.
Introduce a new relation declaredClasses in Relations. It tracks names
of classes declared in given source file. Names of classes are fully
qualified names as seen at pickler phase (i.e. before flattening).
Objects are represented as object name with dollar sign appended to it.

The `declaredClasses` relation will be needed to map invalidated classes
back to source files where they are defined. It's worth mentioning that
there's a relation called `classes` defined already. However, that relation
tracks names of classes as seen by the Scala compiler backend. These are
names that will later appear in bytecode. For our purposes, we want to
have names of classes as they are declared in source code.

Declared classes are tracked properly for both Scala and Java files.

While updating ScalaCompilerForUnitTesting I flipped the nameHashing flag
to be true by default. That's in sync with sbt's default value for that
flag now.
Remove AnalysisTest that tests functionality not supported by name hashing.
We do not support it in a class-based dependency tracking.
Remove tests that are specific to the old incremental compiler and are
failing now.
Everything compiles but tests are failing. This commit should be split into two parts:
  - refactoring of Relations, TextAnalysisFormat, etc. that introduces support for relations where first element in their
     pair is not a file. At the moment, there're many places where code assumes that all relations are Relation[File, T].
   - the rest of the code
The helper method in IncrementalNameHashing was using a wrong
relation and had bugs in logic emitting fatal error messages.
Dependencies coming from classes defined locally are not tracked properly
and proper accounting of such dependencies will require refactoring
of the Dependency phase. We'll need to make an explicit decision where
dependencies of locally defined classes should go. They cannot be tracked
at the level of a class that introduces the dependency because we track
only classes publicly visible from other compilation units (this is
left to be defined precisely later on).

As for now, we just mark a test in DependencySpecification as pending.
Refinements are represented as classes internally but we want to
record dependencies on named classes only. Therefore, we ignore
the refinement class and only look at symbols referred from the
refinement.
Introduce a dedicated ClassToSourceMapper and use it throughout the
incremental compiler. The ClassToSourceMapper takes care of properly
mapping classes that have been moved between source files by looking
at relations before recompilation and after recompilation.
We used to use scala.Symbol to easily identify sources in file system
independent manner. Since we switched to class-based dependency tracking,
we can use class names directly.
Object's members are declared in module class but references to objects
in trees are typechecked with module symbol assigned to it. We handle
references to objects by mapping module symbols to module class symbols.
We would forget to consider declared classes for mapping. This would cause
invalid mapping in case of mixed compilation (of java and scala files).
This commit fixes invalidation caused by api changes to an external
(defined in other project) class. The problem was that we tried to
convert internal class-> external class dependency relation to
src->src but we would end up with an empty relation because we can't map
external class to a src (it's defined outside of current project).

We fix this problem by converting relation to internal src -> external
class. This conversion is always possible.

We also fix a bug in ClassToSourceMapper where conversion would lose some
relations in case the left side of relation (class name) mapped to
several different source files. This is possible when class is moved
between source files.
The `classes` relation does the same thing as `declaredClasses` relation
except it uses different string representation for class names. The
`classes` uses names as seen after flatten phase but `declaredClasses`
looks at symbols at pickler phase.

I think the choice doesn't matter as long as the same naming scheme is
used consisstently everywhere. Let's give flattened class names a try
and find out if we can remove `declaredClasses`.
Modify xsbti.api.Source to carry api hash separately for each
top level definition. Modify api extraction phase to compute hashes for
each top-level definition separately.

This will enable us to invalidate individual top-level classes.
This commit implements invalidation of dependencies at class level both
for inheritance and memberRef dependencies. However, information that
determines api changes is still tracked mostly at source file level.

Name hashes are tracked at source level (classes considered for
invalidation are mapped back to source files before name hash lookup).
For that reason, #2319 is still not fixed.

Api hashes are tracked per-class but only for top-level classes. It's not
a final design but a quick hack to prove that class-based invalidation is
working correctly. It was enough to scenario described in #2320 fixed.
This is shown in `class-based-inheritance` scripted test.
The final design of api hashing will be that all publicly accessible
classes will get their own separate hash.

The sealed test is marked pending as described in notes of #1104.

Other thing to note is that we're getting close to being able to strip
abstractions that deal with the fact that dependencies were tracked either
as Relation[File, File] (for internal dependencies) or
Relation[File, String] (for external dependencies). We track all
dependencies uniformly as Relation[String, String] now.
We tried to switch to src class names everywhere in the incremental
compiler but that broke the logic that was looking up generated class
files. As a result, we wouldn't record inner classes properly due to
difference between src and flat class names.

I've added a scripted test that verifies proper recording of inner classes.
This is a large commit so I'll describe high level changes first and then
go through each individual source file.

At the center of this commit is differentation between binary and source
class name. The incremental compiler needs to track both. The primary way
of tracking classes and referring to them is through source (canonical
in Java's terminology) class names. Binary names are needed for classpath
lookups and turning binary dependencies into source dependencies. We
perform that process each time there's a dependency on a class that has
been compiled in different compiler run and from current compiler run is
seen as a binary dependency (dependency on a class file).

We maintain mapping between source and binary class names in
`Relations.classes` relation. We track in that relation only names of
non-local classes. Non-local casses are the ones that can be referred
from other source files. The top-level, public inner, package private,
protected are all examples of non-local classes. Locals classes are ones
defined within a scope not owned by a class (e.g. in a block or a method)
or anonymous classes. We maintain names of non-local classes because only
those classes have canonical names and only those can be dependencies of
other classes (hence we might need to perform mapping between binary and
source name).

Most changes in this commit are driven by changes to AnalysisCallback,
the Java interface. I changed indentation from tabs to spaces in that file
so the diff is hard to read. Let me explain the most important changes:

  - classDependency and binaryDependency methods have clear names of
    parameters that indicate whether they expects binary or source class
    names
  - generatedClass has been split into two: generatedNonLocalClass and
    generatedLocalClass. The first one takes both forms of the name as
    parameters, whereas the second variant takes just source file and
    class file and records just that. We used to pass some garbage names
    for local classes before. Now we make a clear distinction in
    AnalysisCallback between local and non-local classes and amount of
    information that is tracked about both. This change reduces the size
    of Analysis object and its serialized form. The generatedNonLocalClass
    method overlaps with declaredClass method. I haven't decided what to
    do about it yet but I think declaredClasses can be merged into
    generatedNonLocalClass

Change to `generatedClass` caused a little bit different tracking of
compilation products (class files). We do not track a binary name for
every class file produced; we track it only for class files corresponding
to non-local classes. For that reason we split products into products
and classes in Analysis and Relations.

All changes to Compile.scala (AnalysisCallback implementation) are about
changes to how names and classes are tacked. They are all stragithforward
changes.

Analyzer makes it more clear what kind of names it's tracking and passing
to AnalysisCallback.

Analyze, the class that extracts dependencies from class files compiled
by Java compiler has been adapted to changes in AnalysisCallback and
clarifies when binary and source class names are used. I added a test,
AnalyzeSpecification, that checks whether class dependencies on and from
inner classes are extracted properly.

I also added a scripted test that tests the same thing but verifies
information recorded in Analysis. The reason for a duplicated test is that
scripted test checks a whole lot more:

  - it verifies a correct behavior of AnalysisCallback implementation
  - it verifies that adding class dependencies is routed properly in
    Analysis data structures

Scripted test is an integration test whereas AnalyzeSpecification is a unit
test. In order to implement AnalyzeSpecification I had to introduce
a class that lets me compile Java source on a fly:
JavaCompilerForUnitTesting.

The common functionality between java and scala compiler for unit testing
that extracts data from TestCallback has been pushed to its companion
object. For that to work, I had to add dependency on interfaceProj's test
scope from classfileProj in build.sbt.
Introduce ClassName trait that centralizes logic for creating both source
and binary names for a class symbol. In addition to using it in Analyzer,
Dependency, ExtractDeclaredClasses we also use it in ExtractAPI. We want
every component of the incremental compiler to use consitent naming scheme.

Add ClassNameSpecification that documents expected behavior of ClassName.
The `binaryClassName` relation maintains mapping between source and binary
class names. This mapping is needed to map binary dependencies back to
source dependencies in case of separate compilation (where we see
dependencies on class files). You can see that mapping being used in
`binaryDependency` method implementation of Analysis callback. Previously,
we would map class file to a source file it was produced from and then
assume that dependency is on any (all) of classes declared in that class.
Introduction of `binaryClassName` lets us map dependency back to source
class name directly and remove that imprecision of dependency tracking.

We maintain mapping between source and binary class names just for
non-local classes. Check this
sbt/sbt#1104 (comment) for the
discussion of local and non-local classes.

We also rework tracking of products in Analysis by introducing explicitly
the concept of local and non-local products corresponding to local and
non-local classes. This helps us to clarify for which classes we track
source and binary class names.
Otherwise, scalariform reformats this file on every `compile`
invocation.
Implements a strategy for recording dependencies introduced by top level
imports by assigning those dependencies to the first top level class.
In case there are top level imports but no top level class/trait/object
defined in a compilation unit, a warning is issued. The rationale for this
strategy can be found at:
sbt/sbt#1104 (comment)

Add an unit test covering different cases of top level imports (e.g.
defined in nested packages).

Mark the scripted test source-dependencies/import-class as passing after
a small modification of adding a top level class.
Dependencies introduced by local classes require special handling because
we track only non local classes (that can be referred from other files)
in dependency relations. To overcome this problem, dependencies introduced
by a local class are recorded as introduced by an outer class that is
non local. However, this introduces a new problem with dependencies
introduced by inheritance. We don't want local inheritance dependencies
to cause a transitive invalidation of all classes that inherit from the
outer class containing the local class. Check the comment in
Relations.scala this patches introduces or follow the discussion of this
problem at: sbt/sbt#1104 (comment)

To capture the subtlety of inheritance dependencies from local classes,
we introduce `LocalDependencyByInheritance` case to `DependencyContext`
enum.

TestCallback has been modified to return extracted local inheritance
dependencies and a test in DependencySpecification has been updated
accordingly.

The Dependency phase has been reworked to handle local classes properly
by mapping dependencies to outer, non local classes.Check the
implementation for details of the mapping. It's worth mentioning that
mapping is implemented as an amortized O(1) lookup so this change
doesn't affect performance of extraction phase negatively.

The invalidation logic has been modified to take into account inheritance
dependencies introduced by local classes. The patch is small because
the invalidation logic is very straightforward: we invalidate local
inheritance dependencies non-transitively and we do not apply name hashing
dependency pruning.

Lastly, we mark local-class-inheritance scripted test as passing.
@gkossakowski
Copy link
Contributor Author

I added a small refactoring of the code that performs classpath and Analysis lookups. My motivation was streamline bits of code that are passing entry and forEntry functions.

I updated ExtractAPI per @smarter's comment.

Is this good to go for you?

@smarter
Copy link
Contributor

smarter commented Mar 29, 2016

LGTM in the limited testing I did, however I'm still a bit concerned by the performance regressions on scala/scala (sbt/sbt#1104 (comment)), would be good to know why this happens and if this happens in other big projects too.

@smarter
Copy link
Contributor

smarter commented Mar 29, 2016

Also the tests seem to be failing, no idea why

@Duhemm
Copy link
Contributor

Duhemm commented Mar 30, 2016

LGTM!

@gkossakowski
Copy link
Contributor Author

The test failure was a Travis glitch.

@eed3si9n eed3si9n merged commit c1f3c04 into 1.0 Mar 30, 2016
@eed3si9n eed3si9n deleted the class-dependencies branch March 30, 2016 20:26
@eed3si9n
Copy link
Member

Merged!

@gkossakowski
Copy link
Contributor Author

Excellent news! I'll work on remaining issues. Just as a convenience, here they are:

// to types but that might be a bad thing because it might expand aliases eagerly which
// not what we need
case t: TypeTree if t.original != null =>
t.original.foreach(traverse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this line (and the one above for macro originals) is a performance regression. Both foreach and traverse will recurse into the tree. See #187

Either t.original.foreach(handleTreeNode) or simply traverse(t.original) seems to be the right approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, fixed in #193.

Duhemm added a commit to Duhemm/dotty that referenced this pull request Oct 29, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Oct 30, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 6, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 6, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 6, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 7, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 7, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 7, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 7, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 7, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 7, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 7, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
Duhemm added a commit to Duhemm/dotty that referenced this pull request Nov 13, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
allanrenucci pushed a commit to dotty-staging/dotty that referenced this pull request Dec 21, 2017
Reverted while implementing class-based NameHashing. See:
 - sbt/zinc@189a2a5
 - sbt/zinc#86
retronym added a commit to retronym/scala that referenced this pull request Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants