Seamlessly interfaces Multiple Sequence Alignment software packages including
- ClustalW
- MAFFT
- MUSCLE
- Kalign
- Boxshade
with the Bioconductor infrastructure. The tools have to be installed separately.
In addition, distances between sets of sequences can be calculated providing.
- k-mer-based distances
- Jensen-Shannon divergence (JSD) between FFPs (Sims and Kim, 2011)
- Cosine distance is used between Composition Vectors (Qi et al, 2007)
- Numerical summarization vector distance (Nagar and Hahsler, 2013)
- Jaccard distance between sets of k-mers.
- Distance based on SimRank (Santis et al, 2011)
- Edit distance/alignment-based distances
- Edit (Levenshtein) Distance (package pwalign)
- Distance based on alignment scores (package pwalign)
- Evolutionary distances
- Evolutionary distances (package ape)
A short guide with examples can be found here.
Current development version: Install from r-universe.
install.packages("rMSA",
repos = c("https://mhahsler.r-universe.dev". "https://cloud.r-project.org/"))
Additional installation instructions can be found here.
Align sequences using clustalW.
library("rMSA")
rna <- readRNAStringSet(system.file("examples/RNA_example.fasta", package = "rMSA"))
rna
## RNAStringSet object of length 5:
## width seq names
## [1] 1481 AGAGUUUGAUCCUGGCUCAGAAC...GGUGAAGUCGUAACAAGGUAACC 1675 AB015560.1 d...
## [2] 1404 GCUGGCGGCAGGCCUAACACAUG...CACGGUAAGGUCAGCGACUGGGG 4399 D14432.1 Rho...
## [3] 1426 GGAAUGCUNAACACAUGCAAGUC...AACAAGGUAGCCGUAGGGGAACC 4403 X72908.1 Ros...
## [4] 1362 GCUGGCGGAAUGCUUAACACAUG...UACCUUAGGUGUCUAGGCUAACC 4404 AF173825.1 A...
## [5] 1458 AGAGUUUGAUUAUGGCUCAGAGC...UGAAGUCGUAACAAGGUAACCGU 4411 Y07647.2 Dre...
al <- clustal(rna)
al
## RNAMultipleAlignment with 5 rows and 1500 columns
## aln names
## [1] --------------------------...GNAACAAGGUAGCCGUAGGGGAACC 4403
## [2] --------------------------...------------------------- 4404
## [3] AGAGUUUGAUUAUGGCUCAGAGCGAA...GUAACAAGGUAACCGU--------- 4411
## [4] --------------------------...------------------------- 4399
## [5] AGAGUUUGAUCCUGGCUCAGAACGAA...GUAACAAGGUAACC----------- 1675
Cluster mutations of a sequence using SimRank
s <- random_sequences(len = 100, number = 1)
ms <- mutations(s, number = 20)
dSimRank <- distSimRank(ms)
plot(as.dendrogram(hclust(dSimRank)), horiz=TRUE, type="triangle")
To cite package ‘rMSA’ in publications use:
Hahsler M, Nagar A (2024). rMSA: Interface for Popular Multiple Sequence Alignment Tools. R package version 0.99.1.
@Manual{,
title = {rMSA: Interface for Popular Multiple Sequence Alignment Tools},
author = {Michael Hahsler and Anurag Nagar},
year = {2024},
note = {R package version 0.99.1},
}
Gao, L; Qi, J (2007 Mar 15). “Whole genome molecular phylogeny of large dsDNA viruses using composition vector method.”. BMC evolutionary biology 7: 41. PMID 17359548.
Hahsler M, Nagar A (2024). rMSA: Interface for Popular Multiple Sequence Alignment Tools. R package version 0.99.1.
Anurag Nagar; Michael Hahsler (2013). “Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment.” BMC Bioinformatics, 14(Suppl. 11), 2013
Santis et al, Simrank: Rapid and sensitive general-purpose k-mer search tool, BMC Ecology 2011, 11:11
Sims, GE; Kim, SH (2011 May 17). “Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs).”. Proceedings of the National Academy of Sciences of the United States of America 108 (20): 8329-34. PMID 21536867.
Qi J, Wang B, Hao B: Whole Proteome Prokaryote Phylogeny without Sequence Alignment: A K-String Composition Approach. Journal of Molecular Evolution 2004, 58:1-11.