03-build-analytics.Rmd

# Build Analytics

## Motivation

### Introduction

Ideally, when building a project from source code to executable, the process should be 
fast and error-free. Unfortunately, this is not always the case and automated 
build systems notify developers of compile errors, missing dependencies, broken functionality 
and many other problems. This chapter is aimed to give an overview of the effort made in 
build analytics field and Continuous Integration (CI) as an increasingly common development 
practice in many projects. 

### Continuous Integration and Version Control System

Continuous Integration in a term used in software engineering to describe a practice of merging 
all developer working copies to a shared mainline several times a day. CI is in general used 
together with Version Control System (VCS), an application for revision control that ensures 
the management of changes to documents, source code and other collections of information.

<!-- The following paragraph is prone to errors when auto formatting. After a ..^[..] or ..^[..],
there _must_ be a space, else it is interpreted as superscript. --> 

### Build Analytics Definition

Build analytics covers research on data extracted from a build process inside a project. This
contains among others, build logs from Continuous Integration such as Travis&nbsp;CI^[See
https://travis-ci.org/], Circle&nbsp;CI^[See https://circleci.com/], Jenkins^[See https://jenkins.io/], 
AppVeyor^[See\ https://www.appveyor.com/] and TeamCity^[See https://www.jetbrains.com/teamcity/] or
surveys among developers about their usage of Continuous Integration or build systems. This
information is often paired with data from Version Control Systems such as Git. 

### Research Questions

We aimed to make a complete overview of build analytics field by analyzing both
state of the art and state of practice. We also inspect the future research that could be done
and finally conclude our survey with the research questions that emerged after this exhaustive
field research. To achieve a structured way of summarizing the field, we asked the
following research questions:

**RQ1**: What is the current state of the art in the field of build analytics?

In section \@ref(build-analytics-state-of-the-art) we present the current topics that are being
explored in the build analytics domain alongside the research methods, tools and datasets
acquired for the problems at hand and aggregate and reflect about the main research findings that
the state-of-the-art papers display.

**RQ2**: What is the current state of practice in the field of build analytics?

Section \@ref(build-analytics-state-of-practice) examines scientific papers to analyze the current
trend of build analytics in the software development industry. We look at the popularity of CI in
the industry and explore the increase in the use of CI by discussing its
ample benefits. Furthermore, we will discuss the practices used by engineers in the industry to
ensure that their code is improving and not decaying. 

**RQ3**: What future research can we expect in the field of build analytics?

In section \@ref(build-analytics-future-research) we will explore where new challenges lie in the
field of build analytics. We will also show what open research items are described in the papers.
This section ends with research questions based on the open research and challenges in current
research.

## Research Protocol {#build-analytics-research-protocol}

### Search Strategy

Taking advantage of the initial seed consisting of Bird and Zimmermann[@bird2017predicting], Beller
et al. [@beller2017oops], Rausch et al. [@rausch2017empirical], Beller et al. TravisTorrent
[@beller2017travistorrent], Pinto et al. [@pinto2018work], Zhao et al. [@zhao2017impact], Widder et
al. [@widder2018m] and Hilton et al. [@hilton2016usage], we used references to find new papers to
analyze. Moreover, we used academical search engine _Google Scholar_ to perform a keyword-based
search for other relevant build analytics papers. The keywords used were: build analytics,
machine learning, build time, prediction, continuous integration, build failures, active learning,
build errors, mining, software repositories, open-source software.

### Selection Criteria

In order to provide a valid current overview of the build analytics field, we selected only the
relevant papers that were published after 2008, in other words we have not included papers older
than 10 years. We had chosen 10 years as our threshold inspired by the "ICSE-Most Influential
Paper 10 Years Later" Award. The only paper that does not conform to this rule is the cornerstone
description of CI practices written by Martin Fowler, as we considered it important for us to see
the practices evolution in build analytics field. Most of the papers we founded were linked to our
research questions as references in the sections bellow. From the selected papers, we omitted two
papers, as they are small case studies on a couple of projects and do not introduce new techniques or
applications.

See table \@ref(tab:build-analytics-selected-papers) for an overview of the papers which were
selected for this survey.

## Answers

### Build Analytics State of the Art

**RQ1**: What is the current state of the art in the field of build analytics?

The current state-of-the-art in the build analytics domain refers to the use of machine learning techniques
to increase the productivity when using Continuous Integration (CI), to generate constraints on the 
configuration of the CI that could improve build success rate and to predict build failures even for 
newer projects with less training data available.

The papers identified using the research protocol defined in section
\@ref(build-analytics-research-protocol) that give us an overview of the current state of the art
in build analytics field are:

 * HireBuild: an automatic approach to history-driven repair of build scripts [@hassan2018hirebuild] 
 * A tale of CI build failures: An open source and a financial organization perspective [@vassallo2017tale]
 * (No) Influence of Continuous Integration on the Commit Activity in GitHub Projects [@baltes2018no]
 * Built to last or built too fast?: evaluating prediction models for build times [@bisong2017built]
 * Statically Verifying Continuous Integration Configurations [@santolucito2018statically]
 * ACONA: active online model adaptation for predicting continuous integration build failures [@ni2018acona]

The topics that are being explored are:

  * the importance of the build process in a VCS project in reference @hassan2018hirebuild
  * the impact factors of user satisfaction for using a CI tools in reference @widder2018m
  * methods from helping the developer to fix bugs in references @hassan2018hirebuild, @vassallo2018break
  * predicting build time in reference @bisong2017built
  * predicting build failures in references @santolucito2018statically, @ni2018acona

The tools that are being proposed are:

  * BART to help developers fix build errors by generating a summary of the failures
  with useful information, thus eliminating the need to browse error logs  [@vassallo2018break]
  * HireBuild to automatically fix build failures based on previous changes [@hassan2018hirebuild]
  * VeriCI capable of checking the errors in CI configurations files before the developer 
  pushes a commit and without needing to wait for the build result [@santolucito2018statically]
  * ACONA capable of predicting build failure in CI environment for newer projects with less 
  data available [@ni2018acona]

#### Importance of the Build Process and CI Users Satisfaction

The build process is an important part of a project that uses VCS -- in the sense that findings
by Hassan et al. [@hassan2018hirebuild] suggest that for such projects 22% of code commits include changes
in build script files for building purposes. Moreover, recent studies have focused on how satisfied the users of 
CI tools are. One paper by Widder et al. [@widder2018m] analyzed which factors have an impact 
on abandonment of Travis&nbsp;CI. This paper finds that increased build complexity reduces the chance of 
abandonment, but larger projects ared abandoned at a higher rate and that a project's language has significant
but varying effect. A surprising result is that metrics of configuration attempts and knowledge dispersion 
in the project do not affect its rate of abandonment.

#### Patent for Predicting Build Errors

In [@bird2017predicting], Bird et al. introduce a method for predicting software build errors. 
This US patent is owned by Microsoft. Having logistic regression as machine learning technique, 
the patent describes how to compute the probability of a build to fail. Using this method, build errors can be 
better anticipated, which decreases the time between working builds and speeds up development.

#### Predicting Build Time

Another important aspect is the impact of CI on the development process efficiency. One of the
papers that addresses this matter is written by Bisong et al. [@bisong2017built]. This paper 
aims to find a balance between the frequency of integration and developer's productivity by proposing 
machine learning models that were able to predict the build outcome. For this, they took advantage of the 56 features presented
in the TravisTorrent build records. Their models performed quite well with an R-Squared of around 80%, 
meaning that they were able to capture the variation of build time over multiple projects. Their research 
could be useful on one hand for software developers and project managers for a better time management scheme 
and on the other hand, for other researchers that may improve their proposed models. 

#### Predicting Build Failures

Moreover, usage of automation build tools introduces a delay in the development cycle generated by
the waiting time until the build finish successfully. One of the most recent analyzed papers by
Santolucito et al. [@santolucito2018statically] presents a tool VeriCI capable of checking the errors 
in CI configurations files before the developer pushes a commit and without needing to wait for the 
build result. This paper focuses on prediction of build failure without using metadata such as number of 
commits, code churn also in the learning process, but relying on the actual user programs and configuration 
scripts. This fact makes the identification of the error cause possible. VeriCI achieves 83% accuracy of 
predicting build failure on real data from GitHub projects and 30-48% of time the error justification provided
by the tool matched the actual error cause. These results seem promising, but there is a need to focus
more on producing the error justification fact that could make the use of machine learning tools in real
build analytics tools achievable and tolerated.

#### Prediction with Less Data Available

Even if there were considerable efforts in developing powerful and accurate machine learning models
for predicting the outcome of builds, most of these techniques cannot be trained properly without
extensive historical project data. The problem that resulted from this is that newer projects are unable to take 
advantage of the research conducted before and have to wait until enough data from their project
is generated to sufficiently train machine learning models for predicting the build outcome.
In reference @ni2018acona, the most recent paper of this survey which was published as a poster 
in June 2018, Ni et al. address the problem of build failure prediction in CI environment for newer projects 
with less available data. They are using already trained models from other projects with more data available and 
combined them by means of active learning to find which of these models generalized better 
from the problem at hand and to update the model's weights accordingly. They also aim to cut the expense 
that CI introduces by reducing the label data necessary for training. Even if the method seems promising, 
the results presented in the poster show an F-Measure (harmonic average of recall and precision) of around 
40% that one might are argue should be higher to be truly useful in practice.

### Build Analytics State of Practice

**RQ2**: What is the current state of practice in the field of build analytics?

Continuous Integration is a software engineering practice that requires developers to integrate code into a
shared repository several times a day. Each check-in is then verified by an automated build which allows 
engineers to detect bugs early.

An overview of Continuous Integration evolution from the introduction of the term to the current practices 
can be seen in the figure bellow:

![CI overview.](figures/build-analytics/state_pr.png)

The papers identified using the research protocol defined in section
\@ref(build-analytics-research-protocol) that give us an overview of the current state of the art
in build analytics domain are:

 * Usage, Costs, and Benefits of Continuous Integration in Open-Source Projects [@hilton2016usage]
 * An Empirical Analysis of Build Failures in the Continuous Integration Workflows of Java-Based Open-Source Software [@rausch2017empirical]
 * Continuous Integration [@fowler2006continuous]
 * Enabling Agile Testing Through Continuous Integration [@stolberg2009enabling]
 * TravisTorrent: Synthesizing Travis&nbsp;CI and Github for Full-Stack Research on Continuous Integration [@beller2017travistorrent]
 * I'm Leaving You, Travis: A Continuous Integration Breakup Story [@widder2018m]
 * Continuous integration in a social-coding world: Empirical evidence from GITHUB [@vasilescu2014continuous]
 

The topics that are being explored are:

  * Usage of CI in the industry by @hilton2016usage
  * Growing popularity of CI due to the introduction of VCS as suggested by @rausch2017empirical
  * Common practices used in the industry exemplified by @fowler2006continuous
  * Use of common CI practice in the agile approach presented by @stolberg2009enabling
  * Comparison between pull requests and direct commits to result in successful build as 
  uncovered by @vasilescu2014continuous
  
#### Build Analytics Usage 
  
A survey conducted in open-source projects by Hilton et al. [@hilton2016usage] indicated that 40% of all 
projects used CI. It observed that the average project introduces CI a year into development. Furthermore, 
the paper claims that CI is widely used in practice nowadays. One of many factors contributing to this 
is explored by Rausch et al. [@rausch2017empirical]. The growing popularity of Version Control Systems (VCS) 
such as Git, and hosting build automation platforms such as Travis have enabled any business of size to 
adopt the CI framework. As suggested by Hilton et al. [@hilton2016usage], the cost and time associated with
introducing the CI framework is not enormous and the copious benefits far outweigh the resources required.

#### Build Analytics Practices

The CI concept, often attributed to Martin Fowler [@fowler2006continuous], is recommended as best practice 
of agile software development methods such as extreme Programming [@stolberg2009enabling]. Fowler introduced many 
practices that are essential in maintaining the CI framework. Fowler and Foemmel [@fowler2006continuous] urge
engineers to keep all artifacts required to build the project in a single repository. This ensures that the 
system does not require additional dependencies. In addition, they advise to create a build script that can compile
the code, execute unit tests and automate integration. Once the code is built, all tests should run to confirm
that the built artifact behaves as the developer would expect it to behave. In this way, we are finding and eradicating software
bugs earlier and keeping builds fast. As explored by Widder et al. [@widder2018m], one of the factors that lead 
to companies abandoning the CI framework is the complexity of the build. A good practice is to have more
fast-executing tests than slow tests. 

Furthermore, builds should be readily available to stakeholders and testers as this can reduce the amount of rework 
required when rebuilding a feature that does not meet the requirements. In general, all companies should (at least) schedule a 
"nightly build" to update the project from the repository to ensure everyone is up to date. Continuous Integration 
is all about communication, so it is important to ensure that everyone can easily see the current state of the system. 
This is also another reason why CI works well in the agile industry [@stolberg2009enabling]. 
Both techniques stress the importance of good communication.

The paper by Vasilescu et al. [@vasilescu2014continuous] studies a sample of large and active
GitHub projects developed in Java, Python and Ruby. The paper finds that direct code modifications
(commits) are more popular than indirect code modifications (pull request). Additionally, the
notion of automated testing is not as widely practiced. Most samples in Vasilescu's
study [@vasilescu2014continuous] were configured to use Travis&nbsp;CI, however, less than half actually did use it.
In terms of languages, Ruby projects are among the early adopters of Travis&nbsp;CI, while Java
projects are late to adopt CI. The paper uncovers that the pull requests are much more likely to
result in successful builds than direct commits. 


### Build Analytics Future Research

**RQ3**: What future research can we expect in the field of build analytics?

Currently research on build analytics is limited by some challenges, some are specific to
build analytics and some are applicable to the entire field of software engineering.

The papers identified using the research protocol defined in section
\@ref(build-analytics-research-protocol) that give us an overview of challenges and future research
in the field of build analytics are:

 * Built to last or built too fast?: evaluating prediction models for build times [@bisong2017built]
 * Work Practices and Challenges in Continuous Integration: A Survey with Travis&nbsp;CI Users [@pinto2018work]
 * Statically Verifying Continuous Integration Configurations [@santolucito2018statically]
 * (No) Influence of Continuous Integration on the Commit Activity in GitHub Projects [@baltes2018no]
 * The impact of continuous integration on other software development practices: a large-scale empirical study [@zhao2017impact]
 * Un-Break My Build: Assisting Developers with Build Repair Hints [@vassallo2018break]
 * Oops, my tests broke the build: An explorative analysis of Travis&nbsp;CI with GitHub [@beller2017oops]

In Bisong et al. [@bisong2017built] the main limitation was the performance of the machine learning
algorithm used. In the R implementation was used and it proved not capable of processing the
amounts of data needed. This shows that it is important to choose the right tool when analyzing
data.

In Pinto and Rebouças[@pinto2018work] it is noted that research is often done on open source
software. There are still a lot of possibilities for researching on proprietary software projects.

Tools presented in papers might require a more large-scale and long-term study to verify that the
tool presented keeps up when it is used [@santolucito2018statically].

Future research in build analytics branches in a couple of different topics. Pinto and Rebouças [@pinto2018work]
proposes to focus on getting a better understanding of the users and why they might choose to
abandon an automatic build platform.

Baltes et al. [@baltes2018no] suggest that in future research more perspectives when analyzing commit data should
be considered, for instance partitioning commits by developer. They also note the importance
of more qualitative research.

Some open research questions from recent papers are the following:

  * How do teams change their pull request review practices in response to the introduction of
  continuous integration? [@zhao2017impact]
  * How can we detect if fixing a build configuration requires changes in the remote environment? [@vassallo2018break]
  * Does breaking the build often translate to worse project quality and decreased productivity? [@beller2017oops]
  * Could already trained models on projects with more data available be used to make accurate predictions on newer
  projects with less data available? [@ni2018acona]

From the synthesis of the works discussed in this section the following research questions emerged:

 * What is the impact of the choice of Continuous Integration platform? Most of the research is done
 on users using Travis&nbsp;CI, there are many other platforms out there. Every platform has their own
 characteristics and this could impact the effectiveness for a specific kind of project.
 * How does the platform or programming language influence effectiveness or adoption of continuous
 integration systems?
 * How can machine learning methods be better applied in the field of build analytics in order to 
 generate predictions that are easier to explain and thus can be used in practice?