Skip to content

Latest commit

 

History

History
143 lines (112 loc) · 4.69 KB

README.md

File metadata and controls

143 lines (112 loc) · 4.69 KB

LSUx

R build status Codecov test coverage

Cut rDNA sequences into domains using covariance models

Installation

LSUx uses Infernal (INFERence of RNA ALignment) to match the 5.8S and LSU regions. This means that Infernal must be installed for it to function. Follow the instructions on the Infernal webpage to accomplish this.

Note that Infernal does not work on Windows.

The R interface to Infernal is in package inferrnal (two “r”s).

You can install the development versions of inferrnal and LSUx from GitHub with:

# install.packages("devtools")
devtools::install_github("brendanf/inferrnal")
devtools::install_github("brendanf/LSUx")
library(LSUx)

Usage

LSUx extracts alternating variable and conserved domains from the contiguous rDNA regions which form the eukaryotic ribosomal large subunit, i.e. 5.8S RNA, ITS2, and 28S RNA. For the purposes of this document, this region will be referred to as the 32S precursor RNA, as in humans, although its actual size in Svedberg units varies between lineages.

Input sequences should contain, at a minimum, a significant fraction of the 5.8S RNA, which is used to define the 5’ end of 32S. Any base pairs before the 5’ end of 5.8S will be considered to be ITS1 (ITS1 = TRUE) or discarded (ITS1 = FALSE). Input sequences should not extend past the end of the 32S model at the 3’ end.

LSUx requires two covariance models: one for 5.8S, which is used in to locate the 5’ end of the target region, and one for 32S. The 5.8S model can be RF00002 from Rfam (the default), or an equivalent. If using a custom CM, it must be calibrated using cmcalibrate from Infernal.

The 32S model must include annotations in the reference line (“#=GC RF” in the seed alignment) to distinguish conserved and variable regions. The annotations should be sequential characters in the range 1..9A..Z for conserved domains, “v” for variable domains, and “.” for unaligned gaps in the seed alignment. In the output, the conserved domains will be named “5_8S”, “LSU1”, “LSU2”, …; the variable domains will be named “ITS2”, “V1”, “V2”, …

Two example models are included, both based on the RDP fungal LSU CM, and annotated with variable regions according to Raué (1988). The first, includes the full LSU region.

cm_full <- system.file(
    file.path("extdata", "fungi_32S.cm"),
    package = "LSUx"
)

The second, is truncated at the binding site of the LR5 primer, and should be faster for input sequences which do not extend past that point.

cm_trunc <- system.file(
    file.path("extdata", "fungi_32S_LR5.cm"),
    package = "LSUx"
)

The seed alignments are also provided; they can be accessed using:

system.file(file.path("extdata", "fungi_32S.stk"), package = "LSUx")
system.file(file.path("extdata", "fungi_32S_LR5.stk"), package = "LSUx")

The sample data is from an environmental metabarcoding study focusing on fungi.

seq <- system.file("extdata/sample.fasta", package = "inferrnal")

The sequences in the sample data were amplified using primers ITS1 and LR5, so using the truncated CM is appropriate.

cm_32S_trunc <- system.file(
    file.path("extdata", "fungi_32S_LR5.cm"),
    package = "LSUx"
)

LSUx gives a tibble with the start and end locations of each region for each sequence.

regions <- lsux(seq, cm_32S = cm_32S_trunc, ITS1 = TRUE, cpu = 1)
## INFO [2024-04-04 18:12:21] Beginning CM search.
## INFO [2024-04-04 18:12:21] 48/50 sequences contained a single 5.8S hit.
## INFO [2024-04-04 18:12:21] Beginning CM alignment with mxsize=NULL and cpu=1.
## INFO [2024-04-04 18:12:34] Extracting LSU regions.
regions
## # A tibble: 480 × 5
##    seq_id length region start   end
##    <chr>   <int> <chr>  <int> <int>
##  1 seq1     1404 ITS1       1   114
##  2 seq1     1404 5_8S     115   271
##  3 seq1     1404 ITS2     272   412
##  4 seq1     1404 LSU1     413   517
##  5 seq1     1404 V2       518   673
##  6 seq1     1404 LSU2     674   832
##  7 seq1     1404 V3       833  1106
##  8 seq1     1404 LSU3    1107  1147
##  9 seq1     1404 V4      1148  1242
## 10 seq1     1404 LSU4    1243  1404
## # ℹ 470 more rows