-
Notifications
You must be signed in to change notification settings - Fork 9
Separate sub genomes of an allopolyploid
Polyploid chromosome-resolved genome assemblies have (finally!) become a reality in biology research. A polyploid genome contains two (or more) 'subgenomes', each comprising a set of chromosomes. It's crucial to distinguish between two types of polyploids: autopolyploids, which arise from a single species doubling its chromosome number, and allopolyploids, which emerge from the hybridization of two distinct species. This tutorial will focus exclusively on allopolyploids.
We can represent the evolutionary history of an allopolyploid lineage as follows:
Note the period of divergence between the allopolyploid lineages (ancestral lineages A and B) and consider the accumulation of transposable elements (TEs; consider this before reading the spoilers just below ;) ).
Spoilers: During the initial phase, TE accumulation will be similar for A and B. During the second phase, distinct TEs will accumulate on the genomes of lineages A and B, as they are separate lineages. During the third phase, both subgenomes will possess common TEs once more.
In this tutorial, we will retrieve the subgenomes from A and B by identifying "fossil transposable elements" – ancestral transposable elements that accumulated separately in subgenomes (ancestral lineages) A and B. We will do so by fishing out k-mers with two properties:
- Highly represented (some TEs tend to have repeats that are highly represented when decomposed in k-mers)
- Distinct between homeolog pairs (i.e. groups - pairs, trios - of chromosomes in the same species that originated by speciation and were brought back together in the same genome by allopolyploidization
Before embarking on this analysis, you need to determine the homeolog pairings. For instance, if you have six chromosomes and a allopolyploid with two subgenomes, you need to identify the three pairs. This can be accomplished using UCEs (ultra-conserved elements), COS (conserved ortholog sequences), or synteny/circos. Consult Cerca et al for detailed instructions and literature.
# I like to label folders numerically because it's easy to follow, and once you finish a whole set of analyses and come back to it a few months later, you're able to see the order very clearly.
mkdir 00_data 01_separating_genome 02_jellyfish_counts 03_jellyfish_dumps
2. Second, we download the publicly available Scalesia atractyloides genome published by Cerca et al 2022.
We move in into the folder data, and download the genome.
cd 00_data
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/947/069/175/GCA_947069175.1_scalesia_atractyloides_v1/GCA_947069175.1_scalesia_atractyloides_v1_genomic.fna.gz
In the interest of having clean and easy-to-read names, let's rename it, and unzip it.
mv GCA_947069175.1_scalesia_atractyloides_v1_genomic.fna.gz scalesia.fna.gz
gunzip scalesia.fna.gz*
From a single genome fasta file, we want a file for each of the chromosomes. If you notice, the fasta has 1,634 sequences (grep -c ">" scalesia.fna), but only 34 chromosomes (grep -c "chr" scalesia.fna). This means that the genome has 34 chromosomes and 1,600 'loose' scaffolds. For our approach, we only want the chromosomes.
NCBI fasta files are 'interleaved' - simply put, the sequence comes in multiple sequences. It's often better to operate with fastas that have a simple and repeated two-line structure (i.e. the first line with the header, starting with ">", and the second with the sequence.
awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' scalesia.fna > scalesia.deinterleaved.fna
Now, we obtain a index of the chromosomes, and we modify the names.
grep "chr" scalesia.deinterleaved.fna | \
sed "s/.*chr/chr/; s/,.*/_/" > ../01_separating_genome/chromosome.ids.tsv
I will also replace the "," by "_" from the headers so we operate with easier characters.
sed -i "s/,/_/g" scalesia.deinterleaved.fna
Look into the file.
head ../01_separating_genome/chromosome.ids.tsv
It should look like this: chr_20_ chr_28_ chr_14_ ...
Notice, I added an underscore in the end so it is easier to pull out the chromosomes. If we do: grep "chr1", this will fetch: chr1, chr10, chr11, ... , chr19. By having an underscore in the end, we are able to do: grep "chr1_", without pulling all other data.
Now I will make a while loop, which will work on the genome file (scalesia.deinterleaved.fna), and separate chromosomes into to separate files
while read chr; do \
echo "Working on $chr" ; \
grep --no-group-separator -A 1 "$chr" scalesia.deinterleaved.fna > ../01_separating_genome/${chr}.fa; \
done < ../01_separating_genome/chromosome.ids.tsv
Take a look at the files we created.
ls 01_separating_genome/
It should look like this: chr_10_.fa chr_11_.fa chr_12_.fa ... chr_34_.fa
Remember that for your own genomes you may want to try different k-mer sizes. I used 13 since there was a lot of divergence in the Scalesia atractyloides subgenomes, and k-mer 25 gave me no results.
For this step we need Jellyfish. I used version 2.2.10, but novel versions may be faster. Here, you may want to play with the parameters. I am using 20 cpus (-t 20), a k-mer size of 13 (-m 13) Also, notice I used unix to create a new name for the files ("newname"). In essence chr_24_.fa creates chr_24_.jf The output of jellyfish will be k-mer counts.
for i in 01_separating_genome/*fa; do \
echo $i; newname=$(echo ${i#*/} | sed "s/\.fa/.jf/"); \
jellyfish count -m 13 -s 100M -o 02_jellyfish_counts/$newname -t 20 $i; \
done
What we are doing now is to retrieve human-readable (also R-readable) data from jellyfish using its "dump" function. Here, again, you could tweet some of the parameters. I am using a column format (-c), and only k-mers present >100 times. Since the point of subgenome separation is to retrieve transposable elements, we are looking for regions of the genome that have high counts.
for i in 02_jellyfish_counts/*jf; do \
echo $i; \
jellyfish dump -c -L 100 $i > 03_jellyfish_dumps/${i#*/}.dumps.larger100.col; \
done
Before we move onto R, we will clean the files, by adding headers and replacing spaces " " with tabs "\t".
touch header; echo "information" >> header
for i in *col; do echo $i; cat header $i > tmp; mv tmp $i; sed -i "s/ /\t/; s/information/kmer\tfrequency/" $i; done
This is how the files should look like:
kmer frequency
AAAAAAAAAAAAA 2253
AAAAAAAAAAAAC 504
AAAAAAAAAAAAG 237
AAAAAAAAAAAAT 214
AAAAAAAAAAACA 338
Setting up your environment. We only need to install tidyverse, which is an excellent package.
getwd()
setwd("/Users/josecer/OneDrive - Universitetet i Oslo/tmp/tutorial") # Navigate to the folder where you have the col data.
library(tidyverse) # Load up your library
I will explain one pair. The code below will create an object a variable called s12 and s25 based on the files (ScDrF4C_12.jf.dumps.larger100.col)). We specify it is separated by tabs , and header is true. The weird symbol (%>%) is a pipe (|) provided by tidyverse, and then we rename the column
c12<-read.table("./chr_12_.jf.dumps.larger100.col",sep="\t", header=T) %>%
rename(c12=frequency)
c19<-read.table("./chr_19_.jf.dumps.larger100.col",sep="\t",header=T) %>%
rename(c19=frequency)
We merge both datasets on an object called pair01
pair01<-full_join(c12, c19, by="kmer")
We now filter for k-mers that are either twice-represented on a member of the pair. E.g.: where c12 = 101, and c19 = 300 (keep) where c12 = 1000, and c19 = 1900 (remove) where c12 = 500, and c25 = 201 (keep) where c12 = 600, and c25 = 1004 (remove)
filtered_pair01<- pair01 %>% filter(c12 > 2*c19 | c19 > 2*c12)
nrow(filtered_pair01)
I will repeat the code for two other pairs. I will not do the whole genome here for the interest of the tutorial not becoming too big :-)
11-5
c20<-read.table("./chr_20_.jf.dumps.larger100.col",sep="\t", header=T) %>%
rename(c20=frequency)
c1<-read.table("./chr_1_.jf.dumps.larger100.col",sep="\t",header=T) %>%
rename(c1=frequency)
pair02<-full_join(c20, c1, by="kmer")
filtered_pair02<- pair02 %>% filter(c20 > 2*c1 | c1 > 2*c20)
nrow(filtered_pair02)
c11<-read.table("./chr_11_.jf.dumps.larger100.col",sep="\t", header=T) %>%
rename(c11=frequency)
c5<-read.table("./chr_5_.jf.dumps.larger100.col",sep="\t",header=T) %>%
rename(c5=frequency)
pair03<-full_join(c11, c5, by="kmer")
filtered_pair03<- pair03 %>% filter(c11 > 2*c5 | c5 > 2*c11)
nrow(filtered_pair03)
We make a merged dataframe.
df<-full_join(filtered_pair01, filtered_pair02, by="kmer") %>%
full_join(filtered_pair03, by="kmer")
Prepare ourselves for plotting par(mfrow=c(1,1))
It is my experience that it is best to remove missing data. You may need to play here with your data (i.e. skip or keep this).
cleaned_df<-df[complete.cases(df), ]
We add names to the rownames, and clean the last column
rownames(cleaned_df)<-cleaned_df$kmer
squeaky_cleaned_df<-cleaned_df[,-1]
We transverse it so we can do a hierarchical clustering)
transversed_squeaky_cleaned_df<-t(squeaky_cleaned_df)
We calculate distances between variables, make a hierarchical clustering, and plot it.
dist_transversed_squeaky_cleaned_df <- dist(transversed_squeaky_cleaned_df)
hclust_avg <- hclust(dist_transversed_squeaky_cleaned_df)
plot(hclust_avg, main="13bp kmers at least 100x per chromossome, no missing data; scalesia")
You see the two separate groupings: (c20, c12, c11), (c19, c1, c5)? They're your subgenomes!
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others