-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-intro-geospatial.Rmd
720 lines (589 loc) Β· 32.8 KB
/
03-intro-geospatial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
# Practical session 3
This practical session will demonstrate the application of the most commonly used spatial analysis tools to STx data, and how we work with coordinate data alongside expression data.
## Load packages
```{r 03_loadPackages, eval=TRUE}
library(SpatialFeatureExperiment)
library(tidyverse)
library(scran)
library(scater)
library(ggspavis)
library(sf)
library(spdep)
library(GWmodel)
library(tidyterra)
# STExplorer is our own, under-development package for STx analysis
library(STExplorerDev)
```
- [`spdep`](https://cran.r-project.org/web/packages/spdep/index.html) is a collection of functions to create spatial weights matrix objects from polygon *contiguities*, from point patterns by distance and tessellations. It is used for summarizing these objects, and for permitting their use in spatial data analysis like regional aggregation and tests for spatial *autocorrelation*.
- [`sf`](https://cran.r-project.org/web/packages/sf/index.html) (*Simple Features for R*) is a package that offers support for simple features, a standardized way to encode spatial vector data.
- [`GWmodel`](https://cran.r-project.org/web/packages/GWmodel/index.html) is a suite of models that fit situations when data are not described well by some global model, but where there are spatial regions where a suitably localised calibration provides a better description.
## Background
### Main geocomputational data structures
There are three main data structures that we need to have ready before we undertake a geocomputational approach to STx data analysis. Namely these are; (1) geometries (point and polygon), (2) neighbours lists and (3) distance matrices.
1. Spatial geometries can be points, lines, polygons and pixels. Polygons consist of a multitude of points connected by lines and can have many forms like circle, hexagon, non-canonical polygon etc.
2. Neighbour lists are special types of lists that contain information about the neighbours of each polygon. The neighbours can be defined either by adjacency or by distance.
3. Distance matrices contain the distances between different points and can be either weighted or un-weighted. The weighted distances are usually objective to each point and its neighbours. Meaning that the closer or farther a neighbour is from the point of focus, the weight of their distance changes according to an applied kernel. Usually in the case of STx data, like the ones generated by the 10X Visium platform, the un-weighted distance between two points is expressed in pixels and we acquire it from the `spaceranger` output.
### The `sf` objects
Package `sf` represents simple features as native R objects. All functions and methods in `sf` that operate on spatial data are prefixed by *st\_*, which refers to *spatial type*. Simple features are implemented as R native data, using simple data structures (S3 classes, lists, matrix, vector). The typical use of `sf` involves reading, manipulating and writing of sets of features, with attributes and geometries.
As attributes are typically stored in `data.frame` objects (or the very similar `tbl_df`), we will also store feature geometries in a `data.frame` column. Since geometries are not single-valued, they are put in a list-column, a list of length equal to the number of records in the `data.frame`, with each list element holding the simple feature geometry of that feature. The three classes used to represent simple features are:
- `sf`, the table (`data.frame`) with feature attributes and feature geometries, which contains
- `sfc`, the list-column with the geometries for each feature (record), which is composed of
- `sfg`, the feature geometry of an individual simple feature.
#### Simple feature geometry types
The following seven simple feature types are the most common:
| type | description |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `POINT` | zero-dimensional geometry containing a single point |
| `LINESTRING` | sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry |
| `POLYGON` | geometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring |
| `MULTIPOINT` | set of points; a MULTIPOINT is simple if no two Points in the MULTIPOINT are equal |
| `MULTILINESTRING` | set of linestrings |
| `MULTIPOLYGON` | set of polygons |
| `GEOMETRYCOLLECTION` | set of geometries of any type except GEOMETRYCOLLECTION |
Each of the geometry types can also be a (typed) empty set, containing zero coordinates (for `POINT` the standard is not clear how to represent the empty geometry). Empty geometries can be thought of as being the analogue to missing (`NA`) attributes, NULL values or empty lists.
#### sf: objects with simple features
As we usually do not work with geometries of single `simple features`, but with datasets consisting of sets of features with attributes, the two are put together in `sf` (simple feature) objects. The following command reads a test dataset called `nc` from a file that is contained in the `sf` package:
```{r 03_sf_LoadTest}
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
```
The short report printed gives the file name, the driver (ESRI Shapefile), mentions that there are 100 features (records, represented as rows) and 14 fields (attributes, represented as columns).
This object is of class:
```{r 03_sf_TestClass}
class(nc)
```
meaning it extends (and "is" a) `data.frame`, but with a single list-column with geometries, which is held in the column with name:
```{r 03_sf_Test_sf_column}
attr(nc, "sf_column")
```
If we print the first three features, we see their attribute values and an abridged version of the geometry
```{r 03_sf_Test_print, echo=TRUE, eval=FALSE}
print(nc[9:15], n = 3)
```
which would give the following output:
```{r Sf-overview, echo=FALSE, out.width = "100%", fig.align="center", fig.cap="Overview of the `sf` object."}
knitr::include_graphics("images/sf_xfig.png")
```
In the output we see:
- in green a simple feature: a single record, or `data.frame` row, consisting of attributes and geometry
- in blue a single simple feature geometry (an object of class `sfg`)
- in red a simple feature list-column (an object of class `sfc`, which is a column in the `data.frame`)
- that although geometries are native R objects, they are printed as [well-known text](#wkb)
It is also possible to create `data.frame` objects with geometry list-columns that are not of class `sf`, e.g. by:
```{r 03_sf_no.sf}
nc.no_sf <- as.data.frame(nc)
class(nc.no_sf)
```
However, such objects:
- no longer register which column is the geometry list-column
- no longer have a plot method, and
- lack all of the other dedicated methods for class `sf`
#### sfc: simple feature geometry list-column
The column in the `sf` data.frame that contains the geometries is a list, of class `sfc`. We can retrieve the geometry list-column in this case by using standard `data.frame` notation like `nc$geom` or `nc[[15]]`, but the more general way uses `st_geometry`:
```{r 03_sfc_Test1}
(nc_geom <- st_geometry(nc))
```
Geometries are printed in abbreviated form, but we can view a complete geometry by selecting it, e.g. the first one by:
```{r 03_sfc_Test2}
nc_geom[[1]]
```
The way this is printed is called *well-known text*, and is part of the standards. The word `MULTIPOLYGON` is followed by three parentheses, because it can consist of multiple polygons, in the form of `MULTIPOLYGON(POL1,POL2)`, where `POL1` might consist of an exterior ring and zero or more interior rings, as of `(EXT1,HOLE1,HOLE2)`. Sets of coordinates are held together with parentheses, so we get `((crds_ext)(crds_hole1)(crds_hole2))` where `crds_` is a comma-separated set of coordinates of a ring. This leads to the case above, where `MULTIPOLYGON(((crds_ext)))` refers to the exterior ring (1), without holes (2), of the first polygon (3) - hence three parentheses.
We can see there is a single polygon with no rings:
```{r 03_sfc_Test3, fig.height=3}
par(mar = c(0,0,1,0))
plot(nc[1], reset = FALSE) # reset = FALSE: we want to add to a plot with a legend
plot(nc[1,1], col = 'grey', add = TRUE)
```
Following the `MULTIPOLYGON` data structure, in R we have a list of lists of lists of matrices. For instance, we get the first 3 coordinate pairs of the second exterior ring (first ring is always exterior) for the geometry of feature 4 by:
```{r 03_sfc_Test4}
nc_geom[[4]][[2]][[1]][1:3,]
```
Geometry columns have their own class,
```{r 03_sfc_Test5}
class(nc_geom)
```
#### sfg: simple feature geometry
Simple feature geometry (`sfg`) objects carry the geometry for a single feature, e.g. a point, linestring or polygon.
Simple feature geometries are implemented as R native data, using the following rules
1. a single POINT is a numeric vector
2. a set of points, e.g. in a LINESTRING or ring of a POLYGON is a `matrix`, each row containing a point
3. any other set is a `list`
```{r 03_sf_Test6, echo=FALSE, eval=TRUE, message=FALSE}
p <- rbind(c(3.2,4), c(3,4.6), c(3.8,4.4), c(3.5,3.8), c(3.4,3.6), c(3.9,4.5))
(mp <- st_multipoint(p))
s1 <- rbind(c(0,3),c(0,4),c(1,5),c(2,5))
(ls <- st_linestring(s1))
s2 <- rbind(c(0.2,3), c(0.2,4), c(1,4.8), c(2,4.8))
s3 <- rbind(c(0,4.4), c(0.6,5))
(mls <- st_multilinestring(list(s1,s2,s3)))
p1 <- rbind(c(0,0), c(1,0), c(3,2), c(2,4), c(1,4), c(0,0))
p2 <- rbind(c(1,1), c(1,2), c(2,2), c(1,1))
pol <-st_polygon(list(p1,p2))
p3 <- rbind(c(3,0), c(4,0), c(4,1), c(3,1), c(3,0))
p4 <- rbind(c(3.3,0.3), c(3.8,0.3), c(3.8,0.8), c(3.3,0.8), c(3.3,0.3))[5:1,]
p5 <- rbind(c(3,3), c(4,2), c(4,3), c(3,3))
(mpol <- st_multipolygon(list(list(p1,p2), list(p3,p4), list(p5))))
(gc <- st_geometrycollection(list(mp, mpol, ls)))
```
The below figure illustrates the different types of geometries:
```{r 03_sf_Test7, echo=FALSE, eval=TRUE}
par(mar = c(0.1, 0.1, 1.3, 0.1), mfrow = c(2, 3))
plot(mp, col = 'red')
box()
title("MULTIPOINT")
plot(ls, col = 'red')
box()
title("LINESTRING")
plot(mls, col = 'red')
box()
title("MULTILINESTRING")
plot(pol, border = 'red', col = 'grey', xlim = c(0,4))
box()
title("POLYGON")
plot(mpol, border = 'red', col = 'grey')
box()
title("MULTIPOLYGON")
plot(gc, border = 'grey', col = 'grey')
box()
title("GEOMETRYCOLLECTION")
par(mfrow = c(1, 1))
```
Geometries can also be empty, as in
```{r 03_sf_Test8, collapse=TRUE}
(x <- st_geometrycollection())
length(x)
```
*The above are taken from the very well written, well-descriptive and thorough `sf` package [vignette](https://cran.r-project.org/web/packages/sf/vignettes/sf1.html).*
## Data structures preparation
For this practical we will be using a human steatotic kidney dataset from the [Liver Atlas](https://livercellatlas.org/index.php) [@GUILLIAMS2022379]. Specifically we will use the JBO019 sample.
### Load new dataset
**Note** - Between now and section [3.8](##Neighbour%20graph%20and%20distance%20matrix) nothing new is introduced (this is a repetition of the QC carried out in practical 2, but with this new liver dataset). Feel free to skip over this section and use the [code block](###Code%20for%203.3%20to%203.7) just before section 3.8 to run this QC quickly for this data.
First we generate the `SpatialFeaturesExperiment` object which is an extension of the `SpatialExperiment` (SPE) object that we used in the 2nd practical session. The difference is that the SFE object has incorporated the `sf` object structure and thus can accommodate the use of geocomputational tools.
```{r 03_load_sfe, warning=FALSE, message=FALSE}
sampleDir <- "./data/spaceranger_outs/Human_Liver_Steatotic/JBO019_Results"
sampleNames <- "JBO019"
sfe <- read10xVisiumSFE(samples = sampleDir,
sample_id = sampleNames,
type = "sparse",
data = "filtered",
images = "lowres",
style = "W",
zero.policy = TRUE)
ground_truth <- read_table("./data/to_load/spotzonationGroup.txt")
```
## Spot-level Quality Control
### Calculating QC metrics
In this section we are effectively recapitlating the spot- and gene-level QC from practical 2 for this new dataset, in order that we can use it for the practical exercises in the next session.
```{r 03_QC_sfe1, message=FALSE, warning=FALSE}
is_mito <- grepl("(^MT-)|(^mt-)", rowData(sfe)$symbol)
sfe <- addPerLocQC(sfe, gTruth = ground_truth, assay = "counts", 2, subsets = list(mito = is_mito))
sfe <- addGeometries(sfe, samples = sampleDir, sample_id = sampleNames, res = "fullres")
sfe <- addPerGeneQC(sfe, assay = "counts", version = NULL, mirror = NULL)
colData(sfe)
rowData(sfe)
colGeometries(sfe)
```
### Plot manual annotation
```{r 03_QC_sfe2, message=FALSE, warning=FALSE}
ggplot() +
geom_sf(aes(geometry = colGeometries(sfe)$spotHex$geometry, fill = colData(sfe)$annotation)) +
theme_void() +
theme(legend.position = "right") +
labs(fill = "Annotation")
```
### Library size threshold
```{r 03_QC_sfe3}
# ----------------------------------------------- #
## Density and histogram of library sizes
ggplot(data = as.data.frame(colData(sfe)),
aes(x = sum)) +
geom_histogram(aes(y = after_stat(density)),
colour = "black",
fill = "grey",
bins = 50) +
geom_density(alpha = 0.5,
adjust = 0.5,
fill = "#A0CBE8",
colour = "#4E79A7") +
geom_vline(xintercept = c(1000, NA),
colour = "red",
linetype = "dashed") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
xlab("Library size") +
ylab("Density") +
theme_classic()
## Select library size threshold
qc_lib_size <- colData(sfe)$sum < 1000 #| colData(sfe)$sum > 45000
## Check how many spots are filtered out
table(qc_lib_size)
## Add threshold in colData
colData(sfe)$qc_lib_size <- qc_lib_size
## Check putative spatial patterns of removed spots
ggplot() +
geom_sf(data = colGeometry(sfe, "spotHex"),
aes(geometry = geometry)) +
geom_sf(data = colGeometry(sfe, "spotHex"),
aes(geometry = geometry, fill = colData(sfe)$qc_lib_size)) +
scale_fill_manual(values = c("grey95", "red")) +
labs(fill = "Discarded") +
theme_bw()
```
### Number of expressed genes
```{r 03_QC_sfe4}
# ----------------------------------------------- #
## Density and histogram of expressed genes
ggplot(data = as.data.frame(colData(sfe)),
aes(x = detected)) +
geom_histogram(aes(y = after_stat(density)),
colour = "black",
fill = "grey",
bins = 50) +
geom_density(alpha = 0.5,
adjust = 0.5,
fill = "#A0CBE8",
colour = "#4E79A7") +
geom_vline(xintercept = c(550, NA),
colour = "red",
linetype = "dashed") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
xlab("Genes expressed in each spot") +
ylab("Density") +
theme_classic()
## Select expressed genes threshold
qc_detected <- colData(sfe)$detected < 550 #| colData(sfe)$detected > 6000
## Check how many spots are filtered out
table(qc_detected)
## Add threshold in colData
colData(sfe)$qc_detected <- qc_detected
## Check for putative spatial pattern of removed spots
ggplot() +
geom_sf(data = colGeometry(sfe, "spotHex"),
aes(geometry = geometry)) +
geom_sf(data = colGeometry(sfe, "spotHex"),
aes(geometry = geometry, fill = colData(sfe)$qc_detected)) +
scale_fill_manual(values = c("grey95", "red")) +
labs(fill = "Discarded") +
theme_bw()
```
### Percentage of mitochondrial expression
```{r 03_QC_sfe5}
# ----------------------------------------------- #
## Density and histogram of percentage of mitochondrial expression
ggplot(data = as.data.frame(colData(sfe)),
aes(x = subsets_mito_percent)) +
geom_histogram(aes(y = after_stat(density)),
colour = "black",
fill = "grey",
bins = 50) +
geom_density(alpha = 0.5,
adjust = 0.5,
fill = "#A0CBE8",
colour = "#4E79A7") +
geom_vline(xintercept = c(22, NA),
colour = "red",
linetype = "dashed") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
xlab("Percentage of mitochondrial expression") +
ylab("Density") +
theme_classic()
## Select mitochondrial percentage threshold
qc_mito <- colData(sfe)$subsets_mito_percent > 22
## Check how many spots are filtered out
table(qc_mito)
## Add threshold in colData
colData(sfe)$qc_mito <- qc_mito
## Check for putative spatial pattern of removed spots
ggplot() +
geom_sf(data = colGeometry(sfe, "spotHex"),
aes(geometry = geometry)) +
geom_sf(data = colGeometry(sfe, "spotHex"),
aes(geometry = geometry, fill = colData(sfe)$qc_mito)) +
scale_fill_manual(values = c("grey95", "red")) +
labs(fill = "Discarded") +
theme_bw()
```
### Remove low-quality spots
```{r 03_QC_sfe6}
# ----------------------------------------------- #
## Check the number of discarded spots for each metric
apply(cbind(qc_lib_size, qc_detected, qc_mito), 2, sum)
## Combine together the set of discarded spots
discard <- qc_lib_size | qc_detected | qc_mito
table(discard)
## Store the set in the object
colData(sfe)$discard <- discard
## Check for putative spatial pattern of removed spots
ggplot() +
geom_sf(data = colGeometry(sfe, "spotHex"),
aes(geometry = geometry)) +
geom_sf(data = colGeometry(sfe, "spotHex"),
aes(geometry = geometry, fill = colData(sfe)$discard)) +
scale_fill_manual(values = c("grey95", "red")) +
labs(fill = "Discarded") +
theme_bw()
# ----------------------------------------------- #
## remove combined set of low-quality spots
sfe <- sfe[, !colData(sfe)$discard]
```
## Normalisation of counts
### Log-tranformation of counts
```{r 03_LogNorm_sfe}
## Calculate library size factors
sfe <- computeLibraryFactors(sfe)
## Have a look at the size factors
summary(sizeFactors(sfe))
## Density and histogram of library sizes
ggplot(data = data.frame(sFact = sizeFactors(sfe)),
aes(x = sFact)) +
geom_histogram(aes(y = after_stat(density)),
colour = "black",
fill = "grey",
bins = 40) +
geom_density(alpha = 0.5,
adjust = 0.5,
fill = "#A0CBE8",
colour = "#4E79A7") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
xlab("Library size") +
ylab("Density") +
theme_classic()
# calculate logcounts using library size factors
sfe <- logNormCounts(sfe)
```
## Gene-level Quality Control
### Calculating extra QC metrics
```{r 03_GeneQC_sfe1}
rowData(sfe)[["JBO019.s_logMean"]] <- rowSums(assay(sfe, "logcounts")) / rowData(sfe)[["JBO019.nLocations"]]
```
### Set and apply filters
```{r 03_GeneQC_sfe2}
is_zero <- rowData(sfe)$total == 0
is_logLow <- rowData(sfe)[["JBO019.s_logMean"]] <= 1
discard_gs <- is_zero | is_mito | is_logLow
table(discard_gs)
rowData(sfe)$discard <- discard_gs
## FEATURE SELECTION
## remove mitochondrial and other genes
sfe <- sfe[!rowData(sfe)$discard, ]
```
## Selecting genes
### Highly Variable Genes (HVGs)
```{r 03_HVGs_sfe}
## Fit mean-variance relationship
dec <- modelGeneVar(sfe,
assay.type = "logcounts")
## Visualize mean-variance relationship
fit <- metadata(dec)
fit_df <- data.frame(mean = fit$mean,
var = fit$var,
trend = fit$trend(fit$mean))
## Select top HVGs
top_hvgs <- getTopHVGs(dec,
var.field = "bio",
prop = 0.5,
var.threshold = 0,
fdr.threshold = 0.1)
fit_df <- fit_df %>%
tibble::rownames_to_column(var = "row.names") %>%
dplyr::mutate(topHVGs = ifelse(row.names %in% top_hvgs, TRUE, FALSE)) %>%
tibble::column_to_rownames("row.names")
ggplot(data = fit_df,
aes(x = mean, y = var, colour = topHVGs)) +
geom_point() +
geom_line(aes(y = trend), colour = "dodgerblue", linewidth = 1.5) +
scale_colour_manual(values = c("black", "red")) +
labs(x = "Mean of log-expression",
y = "Variance of log-expression",
colour = "Top HVGs") +
theme_classic()
```
### Code for 3.3 to 3.7
```{r shortcut_code, eval=FALSE}
## Import data
sampleDir <- "./data/spaceranger_outs/Human_Liver_Steatotic/JBO019_Results"
sampleNames <- "JBO019"
sfe <- read10xVisiumSFE(samples = sampleDir,
sample_id = sampleNames,
type = "sparse",
data = "filtered",
images = "lowres",
style = "W",
zero.policy = TRUE)
# ----------------------------------------------- #
ground_truth <- read_table("./data/to_load/spotzonationGroup.txt")
## Add QC metrics
is_mito <- grepl("(^MT-)|(^mt-)", rowData(sfe)$symbol)
sfe <- addPerLocQC(sfe, gTruth = ground_truth, assay = "counts", 2, subsets = list(mito = is_mito))
sfe <- addGeometries(sfe, samples = sampleDir, sample_id = sampleNames, res = "fullres")
sfe <- addPerGeneQC(sfe, assay = "counts", version = NULL, mirror = NULL)
# ----------------------------------------------- #
## SPOT SELECTION
## Select library size threshold
qc_lib_size <- colData(sfe)$sum < 1000
## Add threshold in colData
colData(sfe)$qc_lib_size <- qc_lib_size
## Select expressed genes threshold
qc_detected <- colData(sfe)$detected < 550
## Add threshold in colData
colData(sfe)$qc_detected <- qc_detected
## Select mitochondrial percentage threshold
qc_mito <- colData(sfe)$subsets_mito_percent > 22
## Add threshold in colData
colData(sfe)$qc_mito <- qc_mito
## Combine together the set of discarded spots
discard <- qc_lib_size | qc_detected | qc_mito
## Store the set in the object
colData(sfe)$discard <- discard
## Remove combined set of low-quality spots
sfe <- sfe[, !colData(sfe)$discard]
# ----------------------------------------------- #
## FEATURE SELECTION
## Calculate library size factors
sfe <- computeLibraryFactors(sfe)
## Calculate logcounts using library size factors
sfe <- logNormCounts(sfe)
## Calculate log-counts sample mean
rowData(sfe)[["JBO019.s_logMean"]] <- rowSums(assay(sfe, "logcounts")) / rowData(sfe)[["JBO019.nLocations"]]
## Set and apply filters
is_zero <- rowData(sfe)$total == 0
is_logLow <- rowData(sfe)[["JBO019.s_logMean"]] <= 1
discard_gs <- is_zero | is_mito | is_logLow
rowData(sfe)$discard <- discard_gs
## Remove mitochondrial and other genes
sfe <- sfe[!rowData(sfe)$discard, ]
## Fit mean-variance relationship
dec <- modelGeneVar(sfe,
assay.type = "logcounts")
## Select top HVGs
top_hvgs <- getTopHVGs(dec,
var.field = "bio",
prop = 0.5,
var.threshold = 0,
fdr.threshold = 0.05)
```
## Neighbour graph and distance matrix
### Adding spatial weights
The neighbour lists can be supplemented with spatial weights using the `nb2listw` and `nb2listwdist` function from `spdep` package for the chosen type and coding scheme style. There are 6 different coding scheme styles that can be used to weigh neighbour relationships:
1. **B**: is the basic binary coding (1 for neighbour, 0 for no neighbour).
2. **W**: is row standardised (sums over all links to n).
3. **C**: is globally standardised (sums over all links to n).
4. **U**: is equal to C divided by the number of neighbours (sums over all links to unity).
5. **S**: is the variance-stabilizing coding scheme (sums over all links to n).
6. **minmax**: divides the weights by the minimum of the maximum row sums and maximum column sums of the input weights; It is similar to the C and U styles.
The coding scheme style is practically the value each neighbour will get. For example, in a binary coding scheme style (**B**) if a spot is a neighbour of the spot in focus then gets the value of **1**, else gets **0**. Another example, in a row standardised coding scheme style (**W**) if the spot in focus has a total of 10 neighbours and each neighbour has a weight of 1, then the sum of all neighbour weights is 10, and each neighbour will get a normalised weight of 1/10 = 0.1. As a result, in the row standardised coding scheme, spots with many neighbours will have neighbours with lower weights and thus will not be over-emphasised.
Starting from a binary neighbours list, in which regions are either listed as neighbours or are absent (thus not in the set of neighbours for some definition), we can add a distance-based weights list. The `nb2listwdist` function supplements a neighbours list with spatial weights for the chosen types of distance modelling and coding scheme. While the offered coding schemes parallel those of the `nb2listw` function above, three distance-based types of weights are available: inverse distance weighting (IDW), double-power distance weights (DPD), and exponential distance decay (EXP). The three types of distance weight calculations are based on pairwise distances πππ, all of which are controlled by parameter *"alpha"* (πΌ below):
1. **idw**: π€ππ=πβπΌππ,
2. **exp**: π€ππ=exp(βπΌβ
πππ),
3. **dpd**: π€ππ=[1β(πππ/πmax)πΌ]πΌ,
the latter of which leads to π€ππ=0 for all πππ\>πmax. Note that *IDW* weights show extreme behaviour close to 0 and can take on the value infinity. In such cases, the infinite values are replaced by the largest finite weight present in the weights list.
### Generate distance matrices
A distance matrix is a mirrored matrix that contains the distance between a spot and every other spot. This distance can be a simple Euclidean distance based on the coordinates of the spots or a weighted distance according to a bandwidth around each spot using a kernel that gives higher scores to distances between spots that are closer together compared to the ones that are farther away. These weighted distance matrices are later used to run geographically weighted (GW) models.
There are 6 different kernels that can be used to weight the distances between spots. The next two figures are from the `GWmodel` publication [@Gollini2015Feb] and illustrate the mathematical application of these kernals, and show graphically how they weight by distance.
```{r GWmodelFig1, echo=FALSE, out.width = "100%", fig.align="center", fig.cap="The math equations that define the kernels."}
knitr::include_graphics("images/gwmodel_kernel_math.png")
```
```{r GWmodelFig2, echo=FALSE, out.width = "100%", fig.align="center", fig.cap="Examples from using each kernel."}
knitr::include_graphics("images/gwmodel_kernel_graphs.png")
```
In the below we choose one of the many possible ways of building a neighbour graph for the steatotic liver data set. In this example we are using a k-nearest neighbours approach with row-standardised distance-based weights.
```{r 03_spatial_weights_to_sfe}
## add a neighbour graph using a weighted distance matrix
sfe <- addSpatialNeighGraphs(sfe, "JBO019", type = "knearneigh", style = "W", distMod = "raw", k = 6)
colGraphs(sfe)
## Calculate a simple distance matrix
sfe <- addDistMat(sfe, p = 2)
```
We can use a `geom` from the `tidyterra` package (commonly used for map visualisations) to plot the neighbour graph we generated in the previous step.
```{r 03_visualise_neighbours}
## Retrieve the tissue image
sfei <- getImg(sfe, image_id = "lowres")
## Extract the spot locations
spot_coords <- spatialCoords(sfe) %>% as.data.frame()
## Set limits
xlim <- c(min(spot_coords$pxl_col_in_fullres) - 100,
max(spot_coords$pxl_col_in_fullres) + 100)
ylim <- c(min(spot_coords$pxl_row_in_fullres) - 100,
max(spot_coords$pxl_row_in_fullres) + 100)
nbs <- colGraph(sfe)
ggplot() +
geom_spatraster_rgb(data = imgRaster(sfei)) +
geom_sf(data = as(nb2lines(nbs$neighbours, coords = spatialCoords(sfe)), "sf")) +
lims(x = xlim, y = ylim) +
coord_sf() +
theme_void()
```
Now that we have a fully QC-ed dataset with spatial weights and a neighbour graph applied, we have prepared our data fully for the application of geospatial methods - specifically in practical 4, geogrpahically weighted principal components analysis (GWPCA).
## Putting it all together
The below code puts all these steps in order by selecting one of the options at each step.
```{r all_in_one, message=FALSE, warning=FALSE, eval=FALSE}
## Import data
sampleDir <- "./data/spaceranger_outs/Human_Liver_Steatotic/JBO019_Results"
sampleNames <- "JBO019"
sfe <- read10xVisiumSFE(samples = sampleDir,
sample_id = sampleNames,
type = "sparse",
data = "filtered",
images = "lowres",
style = "W",
zero.policy = TRUE)
# ----------------------------------------------- #
ground_truth <- read_table("./data/to_load/spotzonationGroup.txt")
## Add QC metrics
is_mito <- grepl("(^MT-)|(^mt-)", rowData(sfe)$symbol)
sfe <- addPerLocQC(sfe, gTruth = ground_truth, assay = "counts", 2, subsets = list(mito = is_mito))
sfe <- addGeometries(sfe, samples = sampleDir, sample_id = sampleNames, res = "fullres")
sfe <- addPerGeneQC(sfe, assay = "counts", version = NULL, mirror = NULL)
# ----------------------------------------------- #
## SPOT SELECTION
## Select library size threshold
qc_lib_size <- colData(sfe)$sum < 1000
## Add threshold in colData
colData(sfe)$qc_lib_size <- qc_lib_size
## Select expressed genes threshold
qc_detected <- colData(sfe)$detected < 550
## Add threshold in colData
colData(sfe)$qc_detected <- qc_detected
## Select mitochondrial percentage threshold
qc_mito <- colData(sfe)$subsets_mito_percent > 22
## Add threshold in colData
colData(sfe)$qc_mito <- qc_mito
## Combine together the set of discarded spots
discard <- qc_lib_size | qc_detected | qc_mito
## Store the set in the object
colData(sfe)$discard <- discard
## Remove combined set of low-quality spots
sfe <- sfe[, !colData(sfe)$discard]
# ----------------------------------------------- #
## FEATURE SELECTION
## Calculate library size factors
sfe <- computeLibraryFactors(sfe)
## Calculate logcounts using library size factors
sfe <- logNormCounts(sfe)
## Calculate log-counts sample mean
rowData(sfe)[["JBO019.s_logMean"]] <- rowSums(assay(sfe, "logcounts")) / rowData(sfe)[["JBO019.nLocations"]]
## Set and apply filters
is_zero <- rowData(sfe)$total == 0
is_logLow <- rowData(sfe)[["JBO019.s_logMean"]] <= 1
discard_gs <- is_zero | is_mito | is_logLow
rowData(sfe)$discard <- discard_gs
## Remove mitochondrial and other genes
sfe <- sfe[!rowData(sfe)$discard, ]
## Fit mean-variance relationship
dec <- modelGeneVar(sfe,
assay.type = "logcounts")
## Select top HVGs
top_hvgs <- getTopHVGs(dec,
var.field = "bio",
prop = 0.5,
var.threshold = 0,
fdr.threshold = 0.05)
# ----------------------------------------------- #
## ADD GEOGRAPHY
## Add a neighbour graph using a weighted distance matrix
sfe <- addSpatialNeighGraphs(sfe, "JBO019", type = "knearneigh", style = "W", distMod = "raw", k = 6)
## Calculate a simple distance matrix
sfe <- addDistMat(sfe, p = 2)
```