-
Notifications
You must be signed in to change notification settings - Fork 0
/
ancestry.Rmd
229 lines (173 loc) · 9.03 KB
/
ancestry.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
title: "Ancestry in GWAS"
description: |
How would our ancestry influence GWAS analysis? Why it is problematic?
---
# Ancestry {#ancestry}
## Intro
Another cool aspect of genetics is ancestry. As generation passes, each
individual may carry similar but still different genetic ancestral
composition as their families (like hybrids). Some genetic companies
(i.e. 23andMe) sell kits for customers to learn more about their
genetics, both health wise and ancestry wise.
So how does this area apply to statistical genetics? Well, ancestry has
its connection with GWAS. Besides, scientists can apply mathematical
methods to better infer one's ancestry by machine learning, in other
words, we can guess estimation of one's origin!
## Ancestry can be a confounder in GWAS
GWAS analyzes the statistical association between SNPs and our trait of
interest. In observational studies, ancestry can actually be a
confounder of GWAS. Confounder is a variable that both associates with
the predictor and the response variable.
![Above is an simple DAG graph](file/dag.png)
People with different ancestry can have very different performance on
genotype: for example, for a SNP, people with European ancestry may have different allele
frequencies and different distribution of alleles compared to people with Asian ancestry.
Ancestry may also associate with the trait. Therefore, if we do not
stratify and control for the confounding variable, we may end up having
a spurious outcome in GWAS. This concept is especially important to keep
in mind when carrying out *observational studies*.
After adjusting for ancestry, our GWAS model (marginal regression) would
look like:
$$
E[y|x_j,A] = \beta_0 +\beta_j x_j+\gamma A
$$
where:
- y is the trait
- $x_j$ is the number of minor alleles at position j
- A is the genetic ancestry
- repeat for all positions j = 1, ..., p
However, what should we do when we do not have information on ancestry
for a particular data set? Then, we have to infer their ancestry, and
here is how powerful PCA is.
## An essential approach with inferring ancestry: Principal Component Analysis
What do we do when we do not know one's ancestral decomposition? Well,
luckily, after analyzing many individuals' ancestral and making them
into a huge database, statisticians can utilize previous knowledge and
make guesses on an individual's ancestry even though it is unknown. How
do we achieve this? One of the essential, and popular methods to analyze
such problems is to use Principal Component Analysis, short for PCA.
PCA is a machine learning algorithm, which can predict outcomes for new
data, using information we gained from previous data.
Here are some important key highlights about PCA:
1. It is a technique to reduce the complexity of high dimensional data
to low dimensions by "summarizing the variables", achieved by
expressing **principal components (PCs)** in terms of linear
combination of predictors.
2. The criteria for determining the expression of PCs is to choose the
one which maximizes the variation of the variable values. Therefore,
this technique is called "dimension reduction" since we have fewer
dimensions from variables.
3. It is called an *unsupervised* algorithm, meaning there is no
response variable we are interested in, instead, we want PCA tells
us how my distribution of data looks like by forming some clustering
groups.
Here is a youtube [video](https://www.youtube.com/watch?v=x4_XTBoxr3Q)
explains PCA very well (adapted from Prof. Leslie Myint).
> For your interest and better understanding of PCA conceptually, here
> is a link to a [discussion
> forum](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579)
> to analog PCA concepts with characteristics in wines!
### What is beneficial and special about PCA?
Recall from last section where we mention that ancestry may be a
confounder in GWAS: since PCA captures the information that summarizes
ancestry information,we can include them in our analysis easily just as
adding a linear term to the original marginal regression model (but this
is just for ONE SNP!).
$$
E[Y|X_j,PC_1,PC_2]=\beta_0+\beta_1x_j+\beta_2PC_1+\beta_3 PC_2+...
$$
By PCA, we can efficiently and effectively replace the predictors we
have with new variables (PCs). In other words, we can create fewer, new
variables that help us capture all the traits from our samples. Since
the PCs are linear combination of the old predictors, the PC will look
like:
$$
PC_1 = a_{11}x_1+a_{12}x_2+...+a_{1p}x_p\\
$$
$$
PC_2 = a_{21}x_1+a_{22}x_2+...+a_{2p}x_p\\
...\\
$$
$$
PC_p = a_{p1}x_1+a_{p2}x_2+...+a_{pp}x_p\\
$$
*Note*: $PC_1$ ALWAYS explains the most variability in our data:
considering all possible linear combinations; $PC_2$ ALWAYS explains the
second most variability in the data; same as $PC_3$. The variance
explained by PC decreases with further PCs.
The amazing thing about PCA here is that the algorithm automatically
clusters the data into different dimensions, allowing us to see patterns
very easily. In this context, the "patterns" mean **genetic ancestry**!
For an exemplar visualization, here is a figure from the paper "Genes
mirror geography within Europe." ([Novembre et al.,
2008.](https://www.nature.com/articles/nature07331))
![example of how PC1 and PC2 reflecting European genetic
ancestry](file/pc.jpg)
> The cited paper mainly characterized 3000 European individuals
> genotyped at lots of DNA sites to find out patterns of their
> population structure. The letters in the figure recorded their
> reported geographic information.
#### PCA Vocab
After learning what PCA is, there are some terms you should know for PCA
as well!
- **Score**: The value for an observation to have for PC.
- **Loadings**: The coefficients $a_{11},a_{22},...a{pp}$. They can be
either positive or negative, for sure. The loadings tell you the
contribution of each original variables to the PC (take the absolute
value of the loadings, the greater it is, the more weight this PC
takes this variable in concerns).
- **Variance explained**: The variability of the whole data a PC
captures. This decreases with subsequent PCs.
That's it!
## Lab: Applying PCA
You've come to a lot of understanding of our work today! In this
section, we will take a look at what PCA results look like and explore
their meanings.
### Understand PCA output
Here is a screenshot from my PCA analysis demo. I have 15 SNPs (SNP
1-15). Can you see how many PCs do we have? What do numbers in each
column mean?
![PCA output form](file/pca_out.png)
Summary:
1. There are 15 PCs for our data set.
2. The "standard deviation" column explains the variability (more
specifically, standard deviation is the sqaure root of variance
explained) each PC captures. PC1 has the greatest standard as we
expect, it decreases with the sequence.
3. The "rotation" column are the loadings. For example, in PC1, the
most important variables are SNP 1 and SNP2 because they share the
most absolute value of the loadings (since PC1 captures the most
variability, it indicates that SNP 1 and SNP 2 contribute to it very
much!!). This picture only captures loadings from PC1 to PC6.
### PC score visualization
Let's create plot showing the score for the top PCs: PC1, 2. to
visualize data patterns. What do you learn from this plot?
![PC1 scores V.S. PC2 scores](file/scree.png)
The magic of PCA shows that even if we have no knowledge about the
ancestry, PCA automatically grouped them by the PCs. It seems like we
have a pattern between two population groups (mainly by PC1, that is
where the score differs the most across two groups).
However, given this strong and powerful algorithm, PCA results sometimes
still get its own limitations. Since an individual's genotype may depend
on their ancestry, people with mixed ancestry inherit different genetic
materials from their parents (or grandparents), therefore, for some SNPs
they may carry differently. If we were to carry out PCA on this group of
people, we need to more carefully intepret the results.
The topic about "alleles' narrative potential" is also important, or,
problematic in GWAS and ancestry studies. For example, in biology, there
are situations where one type of allele may functionally outdo the other
one, and resulting different phenotypic traits in them (*complete
dominance*). There are also concepts in biology such as *multiple
alleles, incomplete dominance, codominance*. Therefore, when we carry
out GWAS and adjusting for ancestry, we may or may not intepret their
genotype the same way, thus posing a challenge for researchers to
understand the results fully and correctly.
## Outro
We briefly explored a widely-used method for inferring ancestry: PCA.
We've seen how ancestry can be a potential confounder in our GWAS
analysis, and the way we should address this problematic factor is to
include them into the model by adding PC elements. In real life study,
researchers have also been using this technique to group elements,
shrinking variables, and effectively learn information from a huge data
set.