This was designed as a tutorial for Ian Misner's R course in order to demonstrate how to plot whole genome information, such as Fst, Dxy or any other statistic, in R.
The sample files here are downsampled and they are intended for tutorial use only. Additionally, R is extremely powerful and there are many ways to accomplish the same task. I dappled around trying to make Manhattan plots for awhile and this was my solution to it. I suspect there are much better ways to accomplish it, but this is the pipeline I use. It appears that since I made my solution, there has been a package released called qqman that does something similar.
You may have a BED file, IGV (Integrative Genomics Viewer: https://www.broadinstitute.org/igv/) file or a different type of file. This tutorial is designed to work with IGV files, but it can be modified by tweaking the provide perl scripts to work on BED files or other files formats. Below is the outline of the pipeline I used to get from raw reads into the IGV file format:
First, download the Github file containing all of the files you'll need for this tutorial by click on the "Download ZIP" icon on the right side. Using the terminal window, you'll need to change directories to this folder using this command:
cd ~/Downloads/Manhattan_plots-master/
Inside you should find a file titled: O_niloticus_males_vs_females.downsampled.fst.igv
This is a downsampled file of what was used in Gammerdinger et al., 2014.
- The file should have 5 columns and the first line is the header for the file:
- Column 1 is the chromosome/scaffold/linkage group
- Column 2 is the start position of the feature
- Column 3 is the end position of the feature
- Column 4 is the feature, in this case it is a Single Nucleotide Polymorphism ("snp")
- Column 5 is the score of the feature, in this case it is Fst
Chromosome Start End Feature 1:2
LG1 697 698 snp 0.02473565
LG1 821 822 snp 0.01328195
LG1 1098 1099 snp 0.02965631
LG1 1432 1433 snp 0.01148924
LG1 3842 3843 snp 0.05440369
LG1 8114 8115 snp 0.05959184
LG1 9326 9327 snp 0.03735688
LG1 11606 11607 snp 0.05065753
LG1 14177 14178 snp 0.02231113
...
UNK5647 331 332 snp 0.00877602
UNK5647 607 608 snp 0.03875315
UNK5647 773 774 snp 0.02473372
UNK5647 819 820 snp 0.03518803
UNK5647 902 903 snp 0.01615389
UNK5651 779 780 snp 0.02416214
UNK5651 842 843 snp 0.01711530
UNK5655 248 249 snp 0.04347897
UNK5655 431 432 snp 0.02690219
UNK5655 774 775 snp 0.15040816
Instinctively, you may be tempted to do two things:
-
First, you may try to make a graph for each chromosome/scaffold/linkage group. This is what I did at first, but you realize that since each chromosome/scaffold/linkage group is a different size, that you'd need to scale each chromosome/scaffold/linkage group to be proportional to fairly portray the genome. For example, if we allocate 2 inches for each chromosome, then a 5 megabase chromosome will occupy the same amount of space as a 35 megabase chromosome. The result from this will be that the 35 megabase chromsome will artificially appear to have a denser clustering of SNPs. The solution, as I said above, would be to make plots sized proportionally to the size of the chromosome, so that a plot with 5 megabase chromosome takes up a seventh the space of a 35 megabase chromosome. I suspect that a solution to this criticism exists, but I am unaware of it.
Furthermore, many genome assemblies, particularly for non-model organsims, are thousands of scaffolds. Perhaps this scaling approach could be accomplished manually for a limited number of chromosomes, but it isn't feasible for assemblies that are more fragmented and contain many scaffolds.
-
Second, you may try to put all of the information in one plot. You'll use the start position as the x-axis and the Fst value as the y-axis. This is basically where we are going, but remember you have information about the genome position in column 1 and column 2. The problem is that this idea makes no distinction between position 1 on chromosome 1 and position 1 on chromosome 2. The result of this is shown below. Note the overlapping colors, this is the result of many chromosomes laid on top of each other.
In order to accomplish this you will need to get a few files downloaded and created. First, let's get the tilapia reference genome, which is the reference genome we are working with, using this command:
curl -O -L http://cichlid.umd.edu/download/20120125_MapAssembly.anchored.assembly.fasta.underscores
Next, you'll want to make a file containing the sizes of each chromosome/scaffold/linkage group. In order to do this, we need to use a useful program called Samtools. If you don't already have Samtools, then we need to download and install it on your computer. You'll want to use these commands:
curl -O -L http://sourceforge.net/projects/samtools/files/samtools/1.2/samtools-1.2.tar.bz2
tar xvfj samtools-1.2.tar.bz2
cd samtools-1.2/
make
export PATH=$PATH:~/Downloads/Manhattan_plots-master/samtools-1.2
samtools faidx ~/Downloads/Manhattan_plots-master/20120125_MapAssembly.anchored.assembly.fasta.underscores
cd ..
Now, we have a file containing each chromosome/scaffold/linkage group and the size of the chromosome/scaffold/linkage group as below:
LG1 31194787 5 60 61
LG2 25048291 31714711 60 61
LG3 19325363 57180479 60 61
LG4 28679955 76827937 60 61
LG5 37389089 105985897 60 61
LG6 36725243 143998143 60 61
LG7 51042256 181335479 60 61
LG8_24 29447820 233228448 60 61
LG9 20956653 263167070 60 61
LG10 17092887 284473007 60 61
...
UNK5646 1002 943183303 60 61
UNK5647 1001 943184331 60 61
UNK5648 1001 943185358 60 61
UNK5649 1001 943186385 60 61
UNK5650 1001 943187412 60 61
UNK5651 1000 943188439 60 61
UNK5652 1000 943189465 60 61
UNK5653 1000 943190491 60 61
UNK5654 1000 943191517 60 61
UNK5655 1000 943192543 60 61
What we want now is a file that alters these sizes into a "running total" of the genome. This way postion 1 of LG1 is 1, position 1 of LG2 is 31194788, position 1 of LG3 is 56243079 and etc. To accomplish this goal, we will run the script named Running_chrom.pl using this command-line:
perl Running_chrom.pl --input_file=20120125_MapAssembly.anchored.assembly.fasta.underscores.fai --output_file=O_niloticus_running_chrom_size.txt
The output should look like:
LG1 0
LG2 31194787
LG3 56243078
LG4 75568441
LG5 104248396
LG6 141637485
LG7 178362728
LG8_24 229404984
LG9 258852804
LG10 279809457
...
UNK5646 927669481
UNK5647 927670483
UNK5648 927671484
UNK5649 927672485
UNK5650 927673486
UNK5651 927674487
UNK5652 927675487
UNK5653 927676487
UNK5654 927677487
UNK5655 927678487
Next, you need to run the perl script, Genome_R_script.pl, which will convert the position column into a "running position" column using this command-line:
perl Genome_R_script.pl --input_file=O_niloticus_males_vs_females.downsampled.fst.igv --output_file=O_niloticus_males_vs_females.downsampled.fst.R_ready --chrom_size_file=O_niloticus_running_chrom_size.txt
The output should look like:
LG1 697 0.02473565
LG1 821 0.01328195
LG1 1098 0.02965631
LG1 1432 0.01148924
LG1 3842 0.05440369
LG1 8114 0.05959184
LG1 9326 0.03735688
LG1 11606 0.05065753
LG1 14177 0.02231113
LG1 15117 0.01535743
...
UNK5647 927670814 0.00877602
UNK5647 927671090 0.03875315
UNK5647 927671256 0.02473372
UNK5647 927671302 0.03518803
UNK5647 927671385 0.01615389
UNK5651 927675266 0.02416214
UNK5651 927675329 0.01711530
UNK5655 927678735 0.04347897
UNK5655 927678918 0.02690219
UNK5655 927679261 0.15040816
And you'll need to run the previous perl script again for the Fisher's Exact Test file
perl Genome_R_script.pl --input_file=O_niloticus_males_vs_females.downsampled.fet.igv --output_file=O_niloticus_males_vs_females.downsampled.fet.R_ready --chrom_size_file=O_niloticus_running_chrom_size.txt
Now we are ready to make our Manhattan plots in R. First, we need to open R and set our working directory.
setwd("~/Downloads/Manhattan_plots-master/")
Next, you need to read in your files using these commands
WG_Fst.dat<-read.table("~/Downloads/Manhattan_plots-master/O_niloticus_males_vs_females.downsampled.fst.R_ready", header=F)
WG_Fet.dat<-read.table("~/Downloads/Manhattan_plots-master/O_niloticus_males_vs_females.downsampled.fet.R_ready", header=F)
Lastly, we get to make the png containing two plots using this command:
# Makes a png file named O_niloticus_WG.png in my working directory with a size of 13inx7.5in and a resolution of 300dpi
png("O_niloticus_WG.png",width=13,height=7.5,units="in",res=300)
# Makes two panels stack on top of each other and defines the borders and margins
par(mfrow=c(2,1), oma=c(0.25,0.25,3,0.25), mar=c(5,4,1.5,1))
# Makes the Fst plot without axes in the top panel. This is super bare bones. We want to customize it all ourselves.
plot(WG_Fst.dat$V2,WG_Fst.dat$V3, col=WG_Fst.dat$V1, xlab=NA, ylab=NA, pch=20, cex=0.01, ylim=c(0,1), axes=F)
# Adds a box around the plot in the top panel
box()
# Adds y-axis values to the top panel
axis(2, cex=0.85, las=2)
# Adds the Linkage Group numbers to the x-axis in the top panel
mtext(side=1, " 1 2 3 4 5 6 7 8_24 9 10 11 12 13 14 15 16_21 17 18 19 20 22 23 ", cex=0.75, las=0, line=0.2)
# Adds the line to emcompass the UNK scaffolds for the top panel
mtext(side=1, " _______________________________________________________", cex=0.65, las=0, line=-0.5)
# Adds the text beneath the UNK Scaffolds line to the top panel
mtext(side=1, " UNK1-UNK5655", cex=0.75, las=0, line=0.45)
# Adds an x-axis label to the top panel
mtext(side=1, "Linkage Group", line=2, cex=0.9)
# Adds a y-axis label to the top panel. The square brackets around ST makes subscript.
mtext(side=2, expression("F"[ST]), line=2.5, cex=0.9)
# Adds a title to the top panel
mtext(side=3, expression("Popoolation2's F"[ST]), line=0.2)
# Makes the Fisher's Exact Test plot without axes in the bottom panel.
plot(WG_Fet.dat$V2,WG_Fet.dat$V3, col=WG_Fet.dat$V1, xlab=NA, ylab=NA, pch=20, cex=0.01, ylim=c(0,20), axes=F)
# Adds a box around the plot for the bottom panel
box()
# Adds y-axis values for the bottom panel
axis(2, cex=0.85, las=2)
# Adds the Linkage Group numbers to the x-axis for the bottom panel
mtext(side=1, " 1 2 3 4 5 6 7 8_24 9 10 11 12 13 14 15 16_21 17 18 19 20 22 23 ", cex=0.75, las=0, line=0.2)
# Adds the line to emcompass the UNK scaffolds for the bottom panel
mtext(side=1, " _______________________________________________________", cex=0.65, las=0, line=-0.5)
# Adds the text beneath the UNK Scaffolds line to the bottom panel
mtext(side=1, " UNK1-UNK5655", cex=0.75, las=0, line=0.45)
# Adds an x-axis label to the bottom panel
mtext(side=1, "Linkage Group", line=2, cex=0.9)
# Adds a y-axis label to the bottom panel
mtext(side=2, "-log(p)", line=2.5, cex=0.9)
# Adds a title to the bottom panel
mtext(side=3, expression("Popoolation2's Fisher's Exact Test"), line=0.2)
# Adds lime green dotted line at 1.301 to the bottom panel
abline(h=1.301, lwd=2, col="chartreuse", lty=2)
# Adds red dotted line at 3 to the bottom panel
abline(h=3, lwd=2, col="red", lty=2)
# Closes the png
dev.off()
The final product should look like:
One trick I do here (And I am not entirely sure why it works, because stumbled across it) is I assign the color of each plot ("col" option) to be equal to the chromosome/linkage group/scaffold column. Each time a new chromosome/linkage group/scaffold arises in the plot making process it changes the color.
We can now see that LG1 is interesting to us, so we might want to make a plot of just linkage group 1. This is really easily accomplished and just requires us to modify a few things. First, we will need to make a LG1 subset data table out of our whole genome data table using the the following commands for Fst and Fisher's Exact Test:
LG1_Fst.dat<-subset(WG_Fst.dat, V2 < 31194787,select = c(V1:V3))
LG1_Fet.dat<-subset(WG_Fet.dat, V2 < 31194787,select = c(V1:V3))
Then, we just need to plot them. For the most part the code is similar. There are a few tweaks though. Now we don't worry about making a bunch of text on the bottom of each plot for the linkage groups. Instead, we just have to turn the x-axis on. Another tweak is now I have divide the x-axis by 1,000,000 in order to have the graph scale by megabase. It gives a much cleaner overall look and it is much easier to interpret. The code for the individal chromosome/linkage group/scaffold is below:
# Makes a png file named O_niloticus_LG1.png in my working directory with a size of 13inx7.5in and a resolution of 300dpi
png("O_niloticus_LG1.png",width=13,height=7.5,units="in",res=300)
# Makes two panels stack on top of each other and defines the borders and margins
par(mfrow=c(2,1), oma=c(0.25,0.25,3,0.25), mar=c(5,4,1.5,1))
# Makes the Fst plot without axes in the top panel. Once again, this is super bare bones. We want to customize it all ourselves.
plot(LG1_Fst.dat$V2/1000000,LG1_Fst.dat$V3, col="blue", xlab=NA, ylab=NA, pch=20, cex=0.05, ylim=c(0,1), axes=F)
# Adds a box around the plot in the top panel
box()
# Adds x-axis values for the top panel
axis(1, cex=0.85, las=0)
# Adds y-axis values for the top panel
axis(2, cex=0.85, las=2)
# Adds an x-axis label to the top panel
mtext(side=1, "Position (Mb)", line=2.5, cex=0.9)
# Adds an y-axis label to the top panel
mtext(side=2, expression("F"[ST]), line=2.5, cex=0.9)
# Adds a title to the top panel
mtext(side=3, expression("Popoolation2's F"[ST]), line=0.2)
# Makes the Fisher's Exact Test plot without axes in the bottom panel.
plot(LG1_Fet.dat$V2/1000000,LG1_Fet.dat$V3, col="blue", xlab=NA, ylab=NA, pch=20, cex=0.05, axes=F)
# Adds a box around the plot in the bottom panel
box()
# Adds x-axis values for the bottom panel
axis(1, cex=0.85, las=0)
# Adds y-axis values for the bottom panel
axis(2, cex=0.85, las=2)
# Adds an x-axis label to the bottom panel
mtext(side=1, "Position (Mb)", line=2.5, cex=0.9)
# Adds an y-axis label to the bottom panel
mtext(side=2, "-log(p)", line=2.5, cex=0.9)
# Adds a title to the bottom panel
mtext(side=3, expression("Popoolation2's Fisher's Exact Test"), line=0.2)
# Adds lime green dotted line at 1.301 (p=0.05) to the bottom panel
abline(h=1.301, lwd=2, col="chartreuse", lty=2)
# Adds red dotted line at 3 (p=0.001) to the bottom panel
abline(h=3, lwd=2, col="red", lty=2)
# Closes the png
dev.off()
The final product should appear to look like this:
Congratulations!!! You have now created publication quality plots in R using a next-generation sequencing data set!!!