-
Notifications
You must be signed in to change notification settings - Fork 10
/
README.Rmd
executable file
·358 lines (302 loc) · 15.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
output:
github_document:
toc: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "inst/image/"
)
knitr::opts_chunk$set(fig.width=10, fig.height=6, fig.path="inst/image/")
```
[![Build Status](https://travis-ci.org/BenjaK/scanstatistics.svg?branch=master)](https://travis-ci.org/BenjaK/scanstatistics)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/scanstatistics)](https://cran.r-project.org/package=scanstatistics)
[![CRAN downloads](https://cranlogs.r-pkg.org/badges/grand-total/scanstatistics)](http://cran.rstudio.com/web/packages/scanstatistics/index.html)
[![DOI](http://joss.theoj.org/papers/10.21105/joss.00515/status.svg)](https://doi.org/10.21105/joss.00515)
# scanstatistics
An R package for space-time anomaly detection using scan statistics.
## Installing the package
To install the latest (CRAN) release of this package, type the following:
```{r install_cran, eval=FALSE}
install.packages("scanstatistics")
```
To install the development version of this package, type this instead:
```{r install_github, eval=FALSE}
devtools::install_github("benjak/scanstatistics", ref = "develop")
```
## What are scan statistics?
Scan statistics are used to detect anomalous clusters in spatial or space-time
data. The gist of the methodology, at least in this package, is this:
1. Monitor one or more data streams at multiple _locations_ over intervals of
time.
2. Form a set of space-time _clusters_, each consisting of (1) a collection of
locations, and (2) an interval of time stretching from the present to some
number of time periods in the past.
3. For each cluster, compute a statistic based on both the observed and the
expected responses. Report the clusters with the largest statistics.
## Main functions
### Scan statistics
- __`scan_eb_poisson`__: computes the expectation-based Poisson scan
statistic (Neill 2005).
- __`scan_pb_poisson`__: computes the (population-based) space-time scan
statistic (Kulldorff 2001).
- __`scan_eb_negbin`__: computes the expectation-based negative binomial scan
statistic (Tango et al. 2011).
- __`scan_eb_zip`__: computes the expectation-based zero-inflated Poisson
scan statistic (Allévius & Höhle 2017).
- __`scan_permutation`__: computes the space-time permutation scan statistic
(Kulldorff et al. 2005).
- __`scan_bayes_negbin`__: computes the Bayesian Spatial scan statistic (Neill
2006), extended to a space-time setting.
### Zone creation
- __`knn_zones`__: Creates a set of spatial _zones_ (groups of locations) to
scan for anomalies. Input is a matrix in which rows are the
enumerated locations, and columns the k nearest neighbors.
To create such a matrix, the following two functions are
useful:
+ __`coords_to_knn`__: use `stats::dist` to get the k nearest neighbors
of each location into a format usable by `knn_zones`.
+ __`dist_to_knn`__: use an already computed distance matrix to get the
k nearest neighbors of each location into a format
usable by `knn_zones`.
- __`flexible_zones`__: An alternative to `knn_zones` that uses the adjacency
structure of locations to create a richer set of zones.
The additional input is an adjacency matrix, but
otherwise works as `knn_zones`.
### Miscellaneous
- __`score_locations`__: Score each location by how likely it is to have an
ongoing anomaly in it. This score is heuristically
motivated.
- __`top_clusters`__: Get the top k space-time clusters, either overlapping or
non-overlapping in the spatial dimension.
- __`df_to_matrix`__: Convert a data frame with data for each location and time
point to a matrix with locations along the column
dimension and time along the row dimension, with the
selected data as values.
## Example: Brain cancer in New Mexico
To demonstrate the scan statistics in this package, we will use a dataset of the
annual number of brain cancer cases in the counties of New Mexico, for the years
1973-1991. This data was studied by Kulldorff (1998), who detected a cluster of
cancer cases in the counties Los Alamos and Santa Fe during the years
1986-1989, though the excess of brain cancer in this cluster was not deemed
statistically significant. The data originally comes from the package
_rsatscan_, which provides an interface to the program
[SaTScan](http://www.satscan.org), but it has been aggregated and extended for
the _scanstatistics_ package.
To get familiar with the counties of New Mexico, we begin by plotting them on a
map using the data frames `NM_map` and `NM_geo` supplied by the _scanstatistics_
package:
```{r newmexico_map, eval=TRUE}
library(scanstatistics)
library(ggplot2)
# Load map data
data(NM_map)
data(NM_geo)
# Plot map with labels at centroids
ggplot() +
geom_polygon(data = NM_map,
mapping = aes(x = long, y = lat, group = group),
color = "grey", fill = "white") +
geom_text(data = NM_geo,
mapping = aes(x = center_long, y = center_lat, label = county)) +
ggtitle("Counties of New Mexico")
```
We can further obtain the yearly number of cases and the population for each
country for the years 1973-1991 from the data table `NM_popcas` provided
by the package:
```{r load_count_data}
data(NM_popcas)
head(NM_popcas)
```
It should be noted that Cibola county was split from Valencia county in 1981,
and cases in Cibola have been counted to Valencia in the data.
### A scan statistic for Poisson data
The Poisson distribution is a natural first option when dealing with count data.
The _scanstatistics_ package provides the two functions `scan_eb_poisson` and
`scan_pb_poisson` with this distributional assumption. The first is an
expectation-based[^1] scan statistic for univariate Poisson-distributed data
proposed by Neill et al. (2005), and we focus on this one in the example below.
The second scan statistic is the population-based scan statistic proposed by
Kulldorff (2001).
[^1]: Expectation-based scan statistics use past non-anomalous data to estimate
distribution parameters, and then compares observed cluster counts from the time
period of interest to these estimates. In contrast, _population-based_ scan
statistics compare counts in a cluster to those outside, only using data from
the period of interest, and does so conditional on the observed total count.
#### Using the Poisson scan statistic
The first argument to any of the scan statistics in this package should be a
matrix (or array) of observed counts, whether they be integer counts or
real-valued "counts". In such a matrix, the columns should represent locations
and the rows the time intervals, ordered chronologically from the earliest
interval in the first row to the most recent in the last. In this example we
would like to detect a potential cluster of
brain cancer in the counties of New Mexico during the years 1986-1989, so we
begin by retrieving the count and population data from that period and
reshaping them to a matrix using the helper function `df_to_matrix`:
```{r get_counts}
library(dplyr)
counts <- NM_popcas %>%
filter(year >= 1986 & year < 1990) %>%
df_to_matrix(time_col = "year", location_col = "county", value_col = "count")
```
#### Spatial zones
The second argument to `scan_eb_poisson` should be a list of integer vectors,
each such vector being a _zone_, which is the name for the spatial component of
a potential outbreak cluster. Such a zone consists of one or more locations
grouped together according to their similarity across features, and each
location is numbered as the corresponding column index of the `counts` matrix
above (indexing starts at 1).
In this example, the locations are the counties of New Mexico and the
features are the coordinates of the county seats. These are made available in
the data table `NM_geo`. Similarity will be measured using the geographical
distance between the seats of the counties, taking into account the curvature
of the earth. A distance matrix is calculated using the `spDists` function from
the _sp_ package, which is then passed to `dist_to_knn`
and on to `knn_zones`:
```{r get_zones, eval=TRUE}
library(sp)
library(magrittr)
# Remove Cibola since cases have been counted towards Valencia. Ideally, this
# should be accounted for when creating the zones.
zones <- NM_geo %>%
filter(county != "cibola") %>%
select(seat_long, seat_lat) %>%
as.matrix %>%
spDists(x = ., y = ., longlat = TRUE) %>%
dist_to_knn(k = 15) %>%
knn_zones
```
#### Baselines
The advantage of expectation-based scan statistics is that parameters such as
the expected value can be modelled and estimated from past data e.g. by some
form of regression. For the expectation-based Poisson scan statistic, we can use
a (very simple) Poisson GLM to estimate the expected value of the count in each
county and year, accounting for the different populations in each region.
Similar to the `counts` argument, the expected values should be passed as a
matrix to the `scan_eb_poisson` function:
```{r fit_baselines, eval=TRUE}
mod <- glm(count ~ offset(log(population)) + 1 + I(year - 1985),
family = poisson(link = "log"),
data = NM_popcas %>% filter(year < 1986))
ebp_baselines <- NM_popcas %>%
filter(year >= 1986 & year < 1990) %>%
mutate(mu = predict(mod, newdata = ., type = "response")) %>%
df_to_matrix(value_col = "mu")
```
Note that the population numbers are
(perhaps poorly) interpolated from the censuses conducted in 1973, 1982, and
1991.
#### Calculation
We can now calculate the Poisson scan statistic. To give us more confidence
in our detection results, we will perform 999 Monte Carlo replications, by which
data is generated using the parameters from the null hypothesis and a new scan
statistic calculated. This is then summarized in a P-value, calculated as
the proportion of times the replicated scan statistics exceeded the observed
one. The output of `scan_poisson` is an object of class "scanstatistic", which
comes with the print method seen below.
```{r run_ebp, eval=TRUE}
set.seed(1)
poisson_result <- scan_eb_poisson(counts = counts,
zones = zones,
baselines = ebp_baselines,
n_mcsim = 999)
print(poisson_result)
```
As we can see, the most likely cluster for an anomaly stretches from 1986-1989
and involves the locations numbered 15 and 26, which correspond to the counties
```{r show_MLC, eval=TRUE, comment=NA}
counties <- as.character(NM_geo$county)
counties[c(15, 26)]
```
These are the same counties detected by Kulldorff (1998), though their analysis
was retrospective rather than prospective as ours was. Ours was also data
dredging as we used the same study period with hopes of detecting
the same cluster.
#### A heuristic score for locations
We can score each county according to how likely it is to be part of a cluster
in a heuristic fashion using the function `score_locations`, and visualize the
results on a heatmap as follows:
```{r county_scores, eval=TRUE}
# Calculate scores and add column with county names
county_scores <- score_locations(poisson_result, zones)
county_scores %<>% mutate(county = factor(counties[-length(counties)],
levels = levels(NM_geo$county)))
# Create a table for plotting
score_map_df <- merge(NM_map, county_scores, by = "county", all.x = TRUE) %>%
arrange(group, order)
# As noted before, Cibola county counts have been attributed to Valencia county
score_map_df[score_map_df$subregion == "cibola", ] %<>%
mutate(relative_score = score_map_df %>%
filter(subregion == "valencia") %>%
select(relative_score) %>%
.[[1]] %>% .[1])
ggplot() +
geom_polygon(data = score_map_df,
mapping = aes(x = long, y = lat, group = group,
fill = relative_score),
color = "grey") +
scale_fill_gradient(low = "#e5f5f9", high = "darkgreen",
guide = guide_colorbar(title = "Relative\nScore")) +
geom_text(data = NM_geo,
mapping = aes(x = center_long, y = center_lat, label = county),
alpha = 0.5) +
ggtitle("County scores")
```
A warning though: the `score_locations` function can be quite slow for large
data sets. This might change in future versions of the package.
#### Finding the top-scoring clusters
Finally, if we want to know not just the most likely cluster, but say the five
top-scoring space-time clusters, we can use the function `top_clusters`.
The clusters returned can either be overlapping or non-overlapping in the
spatial dimension, according to our liking.
```{r top_counties, eval=TRUE}
top5 <- top_clusters(poisson_result, zones, k = 5, overlapping = FALSE)
# Find the counties corresponding to the spatial zones of the 5 clusters.
top5_counties <- top5$zone %>%
purrr::map(get_zone, zones = zones) %>%
purrr::map(function(x) counties[x])
# Add the counties corresponding to the zones as a column
top5 %<>% mutate(counties = top5_counties)
```
The `top_clusters` function includes Monte Carlo and Gumbel P-values for each
cluster. These P-values are conservative, since secondary
clusters from the original data are compared to the most likely clusters from
the replicate data sets.
## Concluding remarks
Other univariate scan statistics can be calculated practically in the same way
as above, though the distribution parameters need to be adapted for each scan
statistic.
# Feedback
If you think this package lacks some functionality, or that something needs
better documentation, I happily accept feedback either here at GitHub or via
email at benjak@math.su.se. I'm also very interested in applying the methods in
this package (current and future) to new problems, so if you know of any
suitable public datasets, please tell me! A dataset with a multivariate
response (e.g. multiple counter variables) would be of particular interest for
some of the scan statistics that will appear in future versions of the package.
# References
Allévius, B., M. Höhle (2017): _An expectation-based space-time scan statistic
for ZIP-distributed data_, (under review).
Kleinman, K. (2015): _Rsatscan: Tools, Classes, and Methods for Interfacing
with SaTScan Stand-Alone Software_,
https://CRAN.R-project.org/package=rsatscan.
Kulldorff, M., Athas, W. F., Feuer, E. J., Miller, B. A., Key, C. R. (1998):
_Evaluating Cluster Alarms: A Space-Time Scan Statistic and Brain Cancer in Los
Alamos_, American Journal of Public Health 88 (9), 1377–80.
Kulldorff, M. (2001), _Prospective time periodic geographical disease
surveillance using a scan statistic_, Journal of the Royal Statistical Society,
Series A (Statistics in Society), 164, 61–72.
Kulldorff, M., Heffernan, R., Hartman, J., Assunção, R. M., Mostashari, F.
(2005): _A space-time permutation scan statistic for disease outbreak
detection_, PLoS Medicine, 2 (3), 0216-0224.
Neill, D. B., Moore, A. W., Sabhnani, M., Daniel, K. (2005): _Detection of
Emerging Space-Time Clusters_, In Proceedings of the Eleventh ACM SIGKDD
International Conference on Knowledge Discovery in Data Mining, 218–27. ACM.
Neill, D. B., Moore, A. W., Cooper, G. F. (2006): _A Bayesian Spatial Scan
Statistic_, Advances in Neural Information Processing Systems 18: Proceedings of
the 2005 Conference.
Tango, T., Takahashi, K. Kohriyama, K. (2011), _A Space-Time Scan Statistic for
Detecting Emerging Outbreaks_, Biometrics 67 (1), 106–15.