update
polymorphology2
is an R package offering a general toolkit to efficiently handle genomic data. The primary focus of these functions is to enable the analysis of polymorphisms in the context of various genome features, such as gene bodies, epigenome enrichments, and SBS mutation profiles.
This package provides critical functionalities for identifying overlaps between genome features, and sites and features, with capabilities like calculating the distribution of sites across all genome features. It encompasses various functions routinely used by our lab, including filtering somatic mutations identified by strelka2, constructing windows around genome features like genes, and evaluating the enrichment of ChIPseq experiments in specific genome features.
The package includes the following core functions, among others:
bedGraph_total()
: Calculates the total depth of a bedGraph file.feature_windows()
: Constructs windows around features.features_chip_enrich()
: Calculates the enrichment of ChIPseq experiments in genome features.features_in_features()
: Finds overlaps between different feature sets.features_in_sites()
: Finds the features for given sites.motif_hunter()
: Searches for a specific motif in the provided sequence.plot_feature_windows()
: Plots feature windows.plot_tricontexts()
: Plots trinucleotide contexts.read.GFF()
: Reads GFF files.read.VCF()
: Reads VCF files.read.bedGraph()
: Reads bedGraph files.sites_in_features()
: Finds the sites that are located within features.strelka2_filter()
: Filters somatic mutations called by Strelka2.tricontexts()
: Finds the trinucleotide context for given mutations.
The package can be installed directly from GitHub using devtools
with:
devtools::install_github("greymonroe/polymorphology2")
The package works primarily with two kinds of objects:
- Features - These are
data.table
objects comprising CHROM, START, STOP, and ID columns. - Sites - These are
data.table
objects encompassing CHROM, POS, and ID columns.
Both sites and features can incorporate other columns, which can be leveraged for various computations (e.g., calculating the average depth of ChIP results).
Future updates will include:
- Parsing (extract GT, DP, etc into columns) for VCF files with particular formats (e.g., variants identified with DeepVariant, Strelka2, pbsv, HaplotypeCaller) and additional functions for VCFs with multiple samples.
- Nmer amino acid frequencies across protein sequence
- Addition of user tutorials for a more comprehensive understanding of package usage and functionalities.
The package relies on these R packages (automatically installed):
- data.table
- ggplot2
- seqinr
- vcfR
- stringr
This package is open-source and free to use, modify, and repurpose as needed.
For any issues or inquiries, feel free to reach out to the maintainer:
- Grey Monroe - greymonroe@gmail.com