-
Notifications
You must be signed in to change notification settings - Fork 13
/
03-build-analytics.Rmd
321 lines (247 loc) · 20.6 KB
/
03-build-analytics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
# Build Analytics
## Motivation
### Introduction
Ideally, when building a project from source code to executable, the process should be
fast and error-free. Unfortunately, this is not always the case and automated
build systems notify developers of compile errors, missing dependencies, broken functionality
and many other problems. This chapter is aimed to give an overview of the effort made in
build analytics field and Continuous Integration (CI) as an increasingly common development
practice in many projects.
### Continuous Integration and Version Control System
Continuous Integration in a term used in software engineering to describe a practice of merging
all developer working copies to a shared mainline several times a day. CI is in general used
together with Version Control System (VCS), an application for revision control that ensures
the management of changes to documents, source code and other collections of information.
<!-- The following paragraph is prone to errors when auto formatting. After a ..^[..] or ..^[..],
there _must_ be a space, else it is interpreted as superscript. -->
### Build Analytics Definition
Build analytics covers research on data extracted from a build process inside a project. This
contains among others, build logs from Continuous Integration such as Travis CI^[See
https://travis-ci.org/], Circle CI^[See https://circleci.com/], Jenkins^[See https://jenkins.io/],
AppVeyor^[See\ https://www.appveyor.com/] and TeamCity^[See https://www.jetbrains.com/teamcity/] or
surveys among developers about their usage of Continuous Integration or build systems. This
information is often paired with data from Version Control Systems such as Git.
### Research Questions
We aimed to make a complete overview of build analytics field by analyzing both
state of the art and state of practice. We also inspect the future research that could be done
and finally conclude our survey with the research questions that emerged after this exhaustive
field research. To achieve a structured way of summarizing the field, we asked the
following research questions:
**RQ1**: What is the current state of the art in the field of build analytics?
In section \@ref(build-analytics-state-of-the-art) we present the current topics that are being
explored in the build analytics domain alongside the research methods, tools and datasets
acquired for the problems at hand and aggregate and reflect about the main research findings that
the state-of-the-art papers display.
**RQ2**: What is the current state of practice in the field of build analytics?
Section \@ref(build-analytics-state-of-practice) examines scientific papers to analyze the current
trend of build analytics in the software development industry. We look at the popularity of CI in
the industry and explore the increase in the use of CI by discussing its
ample benefits. Furthermore, we will discuss the practices used by engineers in the industry to
ensure that their code is improving and not decaying.
**RQ3**: What future research can we expect in the field of build analytics?
In section \@ref(build-analytics-future-research) we will explore where new challenges lie in the
field of build analytics. We will also show what open research items are described in the papers.
This section ends with research questions based on the open research and challenges in current
research.
## Research Protocol {#build-analytics-research-protocol}
### Search Strategy
Taking advantage of the initial seed consisting of Bird and Zimmermann[@bird2017predicting], Beller
et al. [@beller2017oops], Rausch et al. [@rausch2017empirical], Beller et al. TravisTorrent
[@beller2017travistorrent], Pinto et al. [@pinto2018work], Zhao et al. [@zhao2017impact], Widder et
al. [@widder2018m] and Hilton et al. [@hilton2016usage], we used references to find new papers to
analyze. Moreover, we used academical search engine _Google Scholar_ to perform a keyword-based
search for other relevant build analytics papers. The keywords used were: build analytics,
machine learning, build time, prediction, continuous integration, build failures, active learning,
build errors, mining, software repositories, open-source software.
### Selection Criteria
In order to provide a valid current overview of the build analytics field, we selected only the
relevant papers that were published after 2008, in other words we have not included papers older
than 10 years. We had chosen 10 years as our threshold inspired by the "ICSE-Most Influential
Paper 10 Years Later" Award. The only paper that does not conform to this rule is the cornerstone
description of CI practices written by Martin Fowler, as we considered it important for us to see
the practices evolution in build analytics field. Most of the papers we founded were linked to our
research questions as references in the sections bellow. From the selected papers, we omitted two
papers, as they are small case studies on a couple of projects and do not introduce new techniques or
applications.
See table \@ref(tab:build-analytics-selected-papers) for an overview of the papers which were
selected for this survey.
## Answers
### Build Analytics State of the Art
**RQ1**: What is the current state of the art in the field of build analytics?
The current state-of-the-art in the build analytics domain refers to the use of machine learning techniques
to increase the productivity when using Continuous Integration (CI), to generate constraints on the
configuration of the CI that could improve build success rate and to predict build failures even for
newer projects with less training data available.
The papers identified using the research protocol defined in section
\@ref(build-analytics-research-protocol) that give us an overview of the current state of the art
in build analytics field are:
* HireBuild: an automatic approach to history-driven repair of build scripts [@hassan2018hirebuild]
* A tale of CI build failures: An open source and a financial organization perspective [@vassallo2017tale]
* (No) Influence of Continuous Integration on the Commit Activity in GitHub Projects [@baltes2018no]
* Built to last or built too fast?: evaluating prediction models for build times [@bisong2017built]
* Statically Verifying Continuous Integration Configurations [@santolucito2018statically]
* ACONA: active online model adaptation for predicting continuous integration build failures [@ni2018acona]
The topics that are being explored are:
* the importance of the build process in a VCS project in reference @hassan2018hirebuild
* the impact factors of user satisfaction for using a CI tools in reference @widder2018m
* methods from helping the developer to fix bugs in references @hassan2018hirebuild, @vassallo2018break
* predicting build time in reference @bisong2017built
* predicting build failures in references @santolucito2018statically, @ni2018acona
The tools that are being proposed are:
* BART to help developers fix build errors by generating a summary of the failures
with useful information, thus eliminating the need to browse error logs [@vassallo2018break]
* HireBuild to automatically fix build failures based on previous changes [@hassan2018hirebuild]
* VeriCI capable of checking the errors in CI configurations files before the developer
pushes a commit and without needing to wait for the build result [@santolucito2018statically]
* ACONA capable of predicting build failure in CI environment for newer projects with less
data available [@ni2018acona]
#### Importance of the Build Process and CI Users Satisfaction
The build process is an important part of a project that uses VCS -- in the sense that findings
by Hassan et al. [@hassan2018hirebuild] suggest that for such projects 22% of code commits include changes
in build script files for building purposes. Moreover, recent studies have focused on how satisfied the users of
CI tools are. One paper by Widder et al. [@widder2018m] analyzed which factors have an impact
on abandonment of Travis CI. This paper finds that increased build complexity reduces the chance of
abandonment, but larger projects ared abandoned at a higher rate and that a project's language has significant
but varying effect. A surprising result is that metrics of configuration attempts and knowledge dispersion
in the project do not affect its rate of abandonment.
#### Patent for Predicting Build Errors
In [@bird2017predicting], Bird et al. introduce a method for predicting software build errors.
This US patent is owned by Microsoft. Having logistic regression as machine learning technique,
the patent describes how to compute the probability of a build to fail. Using this method, build errors can be
better anticipated, which decreases the time between working builds and speeds up development.
#### Predicting Build Time
Another important aspect is the impact of CI on the development process efficiency. One of the
papers that addresses this matter is written by Bisong et al. [@bisong2017built]. This paper
aims to find a balance between the frequency of integration and developer's productivity by proposing
machine learning models that were able to predict the build outcome. For this, they took advantage of the 56 features presented
in the TravisTorrent build records. Their models performed quite well with an R-Squared of around 80%,
meaning that they were able to capture the variation of build time over multiple projects. Their research
could be useful on one hand for software developers and project managers for a better time management scheme
and on the other hand, for other researchers that may improve their proposed models.
#### Predicting Build Failures
Moreover, usage of automation build tools introduces a delay in the development cycle generated by
the waiting time until the build finish successfully. One of the most recent analyzed papers by
Santolucito et al. [@santolucito2018statically] presents a tool VeriCI capable of checking the errors
in CI configurations files before the developer pushes a commit and without needing to wait for the
build result. This paper focuses on prediction of build failure without using metadata such as number of
commits, code churn also in the learning process, but relying on the actual user programs and configuration
scripts. This fact makes the identification of the error cause possible. VeriCI achieves 83% accuracy of
predicting build failure on real data from GitHub projects and 30-48% of time the error justification provided
by the tool matched the actual error cause. These results seem promising, but there is a need to focus
more on producing the error justification fact that could make the use of machine learning tools in real
build analytics tools achievable and tolerated.
#### Prediction with Less Data Available
Even if there were considerable efforts in developing powerful and accurate machine learning models
for predicting the outcome of builds, most of these techniques cannot be trained properly without
extensive historical project data. The problem that resulted from this is that newer projects are unable to take
advantage of the research conducted before and have to wait until enough data from their project
is generated to sufficiently train machine learning models for predicting the build outcome.
In reference @ni2018acona, the most recent paper of this survey which was published as a poster
in June 2018, Ni et al. address the problem of build failure prediction in CI environment for newer projects
with less available data. They are using already trained models from other projects with more data available and
combined them by means of active learning to find which of these models generalized better
from the problem at hand and to update the model's weights accordingly. They also aim to cut the expense
that CI introduces by reducing the label data necessary for training. Even if the method seems promising,
the results presented in the poster show an F-Measure (harmonic average of recall and precision) of around
40% that one might are argue should be higher to be truly useful in practice.
### Build Analytics State of Practice
**RQ2**: What is the current state of practice in the field of build analytics?
Continuous Integration is a software engineering practice that requires developers to integrate code into a
shared repository several times a day. Each check-in is then verified by an automated build which allows
engineers to detect bugs early.
An overview of Continuous Integration evolution from the introduction of the term to the current practices
can be seen in the figure bellow:
![CI overview.](figures/build-analytics/state_pr.png)
The papers identified using the research protocol defined in section
\@ref(build-analytics-research-protocol) that give us an overview of the current state of the art
in build analytics domain are:
* Usage, Costs, and Benefits of Continuous Integration in Open-Source Projects [@hilton2016usage]
* An Empirical Analysis of Build Failures in the Continuous Integration Workflows of Java-Based Open-Source Software [@rausch2017empirical]
* Continuous Integration [@fowler2006continuous]
* Enabling Agile Testing Through Continuous Integration [@stolberg2009enabling]
* TravisTorrent: Synthesizing Travis CI and Github for Full-Stack Research on Continuous Integration [@beller2017travistorrent]
* I'm Leaving You, Travis: A Continuous Integration Breakup Story [@widder2018m]
* Continuous integration in a social-coding world: Empirical evidence from GITHUB [@vasilescu2014continuous]
The topics that are being explored are:
* Usage of CI in the industry by @hilton2016usage
* Growing popularity of CI due to the introduction of VCS as suggested by @rausch2017empirical
* Common practices used in the industry exemplified by @fowler2006continuous
* Use of common CI practice in the agile approach presented by @stolberg2009enabling
* Comparison between pull requests and direct commits to result in successful build as
uncovered by @vasilescu2014continuous
#### Build Analytics Usage
A survey conducted in open-source projects by Hilton et al. [@hilton2016usage] indicated that 40% of all
projects used CI. It observed that the average project introduces CI a year into development. Furthermore,
the paper claims that CI is widely used in practice nowadays. One of many factors contributing to this
is explored by Rausch et al. [@rausch2017empirical]. The growing popularity of Version Control Systems (VCS)
such as Git, and hosting build automation platforms such as Travis have enabled any business of size to
adopt the CI framework. As suggested by Hilton et al. [@hilton2016usage], the cost and time associated with
introducing the CI framework is not enormous and the copious benefits far outweigh the resources required.
#### Build Analytics Practices
The CI concept, often attributed to Martin Fowler [@fowler2006continuous], is recommended as best practice
of agile software development methods such as extreme Programming [@stolberg2009enabling]. Fowler introduced many
practices that are essential in maintaining the CI framework. Fowler and Foemmel [@fowler2006continuous] urge
engineers to keep all artifacts required to build the project in a single repository. This ensures that the
system does not require additional dependencies. In addition, they advise to create a build script that can compile
the code, execute unit tests and automate integration. Once the code is built, all tests should run to confirm
that the built artifact behaves as the developer would expect it to behave. In this way, we are finding and eradicating software
bugs earlier and keeping builds fast. As explored by Widder et al. [@widder2018m], one of the factors that lead
to companies abandoning the CI framework is the complexity of the build. A good practice is to have more
fast-executing tests than slow tests.
Furthermore, builds should be readily available to stakeholders and testers as this can reduce the amount of rework
required when rebuilding a feature that does not meet the requirements. In general, all companies should (at least) schedule a
"nightly build" to update the project from the repository to ensure everyone is up to date. Continuous Integration
is all about communication, so it is important to ensure that everyone can easily see the current state of the system.
This is also another reason why CI works well in the agile industry [@stolberg2009enabling].
Both techniques stress the importance of good communication.
The paper by Vasilescu et al. [@vasilescu2014continuous] studies a sample of large and active
GitHub projects developed in Java, Python and Ruby. The paper finds that direct code modifications
(commits) are more popular than indirect code modifications (pull request). Additionally, the
notion of automated testing is not as widely practiced. Most samples in Vasilescu's
study [@vasilescu2014continuous] were configured to use Travis CI, however, less than half actually did use it.
In terms of languages, Ruby projects are among the early adopters of Travis CI, while Java
projects are late to adopt CI. The paper uncovers that the pull requests are much more likely to
result in successful builds than direct commits.
### Build Analytics Future Research
**RQ3**: What future research can we expect in the field of build analytics?
Currently research on build analytics is limited by some challenges, some are specific to
build analytics and some are applicable to the entire field of software engineering.
The papers identified using the research protocol defined in section
\@ref(build-analytics-research-protocol) that give us an overview of challenges and future research
in the field of build analytics are:
* Built to last or built too fast?: evaluating prediction models for build times [@bisong2017built]
* Work Practices and Challenges in Continuous Integration: A Survey with Travis CI Users [@pinto2018work]
* Statically Verifying Continuous Integration Configurations [@santolucito2018statically]
* (No) Influence of Continuous Integration on the Commit Activity in GitHub Projects [@baltes2018no]
* The impact of continuous integration on other software development practices: a large-scale empirical study [@zhao2017impact]
* Un-Break My Build: Assisting Developers with Build Repair Hints [@vassallo2018break]
* Oops, my tests broke the build: An explorative analysis of Travis CI with GitHub [@beller2017oops]
In Bisong et al. [@bisong2017built] the main limitation was the performance of the machine learning
algorithm used. In the R implementation was used and it proved not capable of processing the
amounts of data needed. This shows that it is important to choose the right tool when analyzing
data.
In Pinto and Rebouças[@pinto2018work] it is noted that research is often done on open source
software. There are still a lot of possibilities for researching on proprietary software projects.
Tools presented in papers might require a more large-scale and long-term study to verify that the
tool presented keeps up when it is used [@santolucito2018statically].
Future research in build analytics branches in a couple of different topics. Pinto and Rebouças [@pinto2018work]
proposes to focus on getting a better understanding of the users and why they might choose to
abandon an automatic build platform.
Baltes et al. [@baltes2018no] suggest that in future research more perspectives when analyzing commit data should
be considered, for instance partitioning commits by developer. They also note the importance
of more qualitative research.
Some open research questions from recent papers are the following:
* How do teams change their pull request review practices in response to the introduction of
continuous integration? [@zhao2017impact]
* How can we detect if fixing a build configuration requires changes in the remote environment? [@vassallo2018break]
* Does breaking the build often translate to worse project quality and decreased productivity? [@beller2017oops]
* Could already trained models on projects with more data available be used to make accurate predictions on newer
projects with less data available? [@ni2018acona]
From the synthesis of the works discussed in this section the following research questions emerged:
* What is the impact of the choice of Continuous Integration platform? Most of the research is done
on users using Travis CI, there are many other platforms out there. Every platform has their own
characteristics and this could impact the effectiveness for a specific kind of project.
* How does the platform or programming language influence effectiveness or adoption of continuous
integration systems?
* How can machine learning methods be better applied in the field of build analytics in order to
generate predictions that are easier to explain and thus can be used in practice?