Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create SNP workflow #53

Merged
merged 27 commits into from
Jul 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
8278170
initial commit [skip ci]
rnmitchell May 11, 2023
48bacaa
began work on snp pipeline, saving work [skip ci]
rnmitchell May 12, 2023
3481711
added kintelligence snps to convert wrapper script
rnmitchell May 16, 2023
241042e
fix up convert script [skip ci]
rnmitchell May 17, 2023
863e6ae
added filtering of non-typed alleles [skip ci]
rnmitchell May 17, 2023
b2c3c36
fixing up strait razor processing [skip ci]
rnmitchell May 18, 2023
47f5738
fixing strait razor processing [skip ci]
rnmitchell May 18, 2023
b1379ee
updated snp tests [skip ci]
rnmitchell May 19, 2023
539001c
improved handling of filtering by snp type [skip ci]
rnmitchell May 19, 2023
e402cef
removed duplicate rows when both a and p snps included
rnmitchell May 19, 2023
99f186c
fixed strait razor processing [skip ci]
rnmitchell Jun 6, 2023
ff0508c
changed forenseq kit name to sigprep [skip ci]
rnmitchell Jun 25, 2023
c43ca77
began writing snps format script [skip ci]
rnmitchell Jun 26, 2023
e7b156a
finished code for formatting evidence sigprep samples [skip ci]
rnmitchell Jun 26, 2023
77eccc4
added strait razor processing [skip ci]
rnmitchell Jun 26, 2023
040b43b
updated reference samples [skip ci]
rnmitchell Jun 26, 2023
9dfe0a9
switched convert and format commands to be consistent with strs [skip…
rnmitchell Jul 5, 2023
68c2543
filtering of untyped alleles and below AT
rnmitchell Jul 11, 2023
c99f319
added tests; fixed bug with having multiple references
rnmitchell Jul 18, 2023
9a25dcf
added strand orientation option to STRs
rnmitchell Jul 19, 2023
8ece1d7
fixed bug with specifying strand orientation; added test for using fo…
rnmitchell Jul 19, 2023
d80aff3
updated README
rnmitchell Jul 19, 2023
0b35f52
updated manifest and setup.py
rnmitchell Jul 19, 2023
2c4fffc
fixed bug with having multiple reference ids listed
rnmitchell Jul 20, 2023
7160c2d
added file for test
rnmitchell Jul 20, 2023
96fb563
added another file for tests
rnmitchell Jul 20, 2023
4bdcb53
fixed headers
rnmitchell Jul 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ include lusSTR/tests/data/UAS_bulk_input/*
include lusSTR/tests/data/snps/*
include lusSTR/tests/data/RU_stutter_test/*
include lusSTR/tests/data/NGS_stutter_test/*
include lusSTR/tests/data/kinsnps/*
59 changes: 55 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,18 @@ make devenv
## Usage

lusSTR accomodates three different input formats:
(1) UAS Sample Details Report and UAS Phenotype Report (for SNP processing) in .xlsx format (a single file or directory containing multiple files)
(1) UAS Sample Details Report, UAS Sample Report, and UAS Phenotype Report (for SNP processing) in .xlsx format (a single file or directory containing multiple files)
(2) STRait Razor output with one sample per file (a single file or directory containing multiple files)
(3) Sample(s) sequences in CSV format; first four columns must be Locus, NumReads, Sequence, SampleID; Optional last two columns can be Project and Analysis IDs.

*These individual sample files or directory of files must be specified in the config file (see below).*


lusSTR utilizes the ```lusstr``` command to invoke various Snakemake workflows. The ```lusstr strs``` command invokes the STR analysis workflow. *The SNP workflow is currently under construction.*
lusSTR utilizes the ```lusstr``` command to invoke various Snakemake workflows. The ```lusstr strs``` command invokes the STR analysis workflow.

The ```lusstr snps``` command invokes the SNP analysis workflow. Please see below for further information on processing SNP data.
___
### Creating the config file
### Creating the SNP config file

Running ```lusstr config``` creates a config file containing the default settings for the lusSTR STR analysis pipeline. The settings can be changed with command line arguments (see below) or by manually editing the config file. The default settings, along with their descriptions, are as follows:

Expand All @@ -57,6 +59,7 @@ data_type: ```ngs``` (ce/ngs) (invoke ```--ce``` if using CE allele data)
info: ```True``` (True/False); create allele information file (invoke ```--noinfo``` flag to not create the allele information file)
separate: ```False``` (True/False); for EFM only, if True will create individual files for samples; if False, will create one file with all samples (invoke ```--separate``` flag to separate EFM output files)
nofilters: ```False``` (True/False); skip all filtering steps but still creates EFM/STRmix output files (invoke ```--nofilters``` flag)
strand: ```uas``` (uas/forward); indicates the strand orientation in which to report the sequence in the final output table for STRmix NGS only (indicate using ```--strand```)

One additional argument can be provided with ```lusstr config```:
```-w```/```-workdir``` sets the working directory (e.g. ```-w lusstr_files/```) and all created files are stored in that directory.
Expand Down Expand Up @@ -175,8 +178,56 @@ ___

## SNP Data Processing

Currently under construction
lusSTR is able to process SNPs derived from the ForenSeq Signature Prep assay and the ForenSeq Kintelligence assay. SNPs from the ForenSeq Signature Prep assay could be analyzed using either the Verogen UAS or STRait Razor. SNPs from the ForenSeq Kintelligence assay must first be analyzed using the UAS.

___
### Creating the STR config file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy/paste error: should be "Creating the SNP config file"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol good catch


Running ```lusstr config --snps``` creates a config file containing the default settings for the lusSTR SNP analysis pipeline. The settings can be changed with command line arguments (see below) or by manually editing the config file. The default settings, along with their descriptions, are as follows:


### general settings
uas: ```True``` (True/False); if ran through UAS (invoke ```--straitrazor``` flag if STRait Razor was used)
samp_input: ```/path/to/input/directory/or/samples``` input directory or sample; if not provided, will be current working directory (indicate using ```--input path/to/dir```)
output: ```lusstr_output``` output file/directory name; (indicate using ```--out dir/sampleid e.g. --out test_030923```)
kit: ```sigprep``` (sigprep/kintelligence) (invoke using the ```--kintelligence``` flag if using Kintelligence data)

### format settings
types: ```all``` choices are "all", "i" (identity SNPs only), "p" (phenotype only), "a" (ancestry only) or any combination (indicate using the ```--snp-type e.g. --snp-type i, p```)
nofilter: ```False``` (True/False); if no filtering is desired at the format step; if False, will remove any allele designated as Not Typed (invoke using the ```--nofiltering```)

### convert settings
strand: ```forward``` (forward/uas); indicates which orientation to report the alleles for the SigPrep SNPs; uas indicates the orientation as reported by the UAS or the forward strand
references: ## list IDs of the samples to be run as references in EFM; default is no reference samples
separate: false ## True/False; if want to separate samples into individual files for use in EFM
thresh: 0.03 ## Analytical threshold value


One additional argument can be provided with ```lusstr config```:
```-w```/```-workdir``` sets the working directory (e.g. ```-w lusstr_files/```) and all created files are stored in that directory.



### general settings:
uas: ```True``` (True/False); if ran through UAS (invoke ```--straitrazor``` flag if STRait Razor was used)
sex: ```False``` (True/False); include sex-chromosome STRs (invoke ```--sex``` flag)
samp_input: ```/path/to/input/directory/or/samples``` input directory or sample; if not provided, will be current working directory (indicate using ```--input path/to/dir``` )
output: ```lusstr_output``` output file/directory name (indicate using ```--out dir/sampleid e.g. --out test_030923```)

### convert settings
kit: ```forenseq``` (forenseq/powerseq) (invoke the ```--powerseq``` flag if using PowerSeq data)
nocombine: ```False``` (True/False); do not combine identical sequences during the ```convert``` step, if using STRait Razor data. (invoke the ```--nocombine``` flag)

### filter settings
output_type: ```strmix``` (strmix/efm) (invoke ```--efm``` flag if creating output for EuroForMix)
profile_type: ```evidence``` (evidence/reference) (invoke ```--reference``` flag if creating a reference output file)
data_type: ```ngs``` (ce/ngs) (invoke ```--ce``` if using CE allele data)
info: ```True``` (True/False); create allele information file (invoke ```--noinfo``` flag to not create the allele information file)
separate: ```False``` (True/False); for EFM only, if True will create individual files for samples; if False, will create one file with all samples (invoke ```--separate``` flag to separate EFM output files)
nofilters: ```False``` (True/False); skip all filtering steps but still creates EFM/STRmix output files (invoke ```--nofilters``` flag)
strand: ```forward``` (uas/forward); indicates the strand orientation in which to report the alleles in the final output table (indicate using ```--strand```)

----


lusSTR is still under development and any suggestions/issues found are welcome!
71 changes: 65 additions & 6 deletions lusSTR/cli/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,45 @@

def main(args):
Path(args.workdir).mkdir(parents=True, exist_ok=True)
final_dest = f"{args.workdir}/config.yaml"
config = resource_filename("lusSTR", "data/config.yaml")
final_config = edit_config(config, args)
if args.snps:
final_dest = f"{args.workdir}/snp_config.yaml"
config = resource_filename("lusSTR", "data/snp_config.yaml")
final_config = edit_snp_config(config, args)
else:
final_dest = f"{args.workdir}/config.yaml"
config = resource_filename("lusSTR", "data/config.yaml")
final_config = edit_str_config(config, args)
with open(final_dest, "w") as file:
yaml.dump(final_config, file)

def edit_config(config, args):

def edit_snp_config(config, args):
with open(config, "r") as file:
data = yaml.safe_load(file)
if args.straitrazor:
data["uas"] = False
if args.input:
data["samp_input"] = args.input
if args.out:
data["output"] = args.out
if args.snptype:
data["types"] = args.snptype
if args.kintelligence:
data["kit"] = "kintelligence"
if args.separate:
data["separate"] = True
if args.nofiltering:
data["nofilter"] = True
if args.ref:
data["references"] = args.ref
else:
data["references"] = None
if args.strand:
data["strand"] = args.strand
return data


def edit_str_config(config, args):
with open(config, "r") as file:
data = yaml.safe_load(file)
if args.straitrazor:
Expand Down Expand Up @@ -55,6 +87,8 @@ def edit_config(config, args):
data["data_type"] = "ce"
if args.efm:
data["output_type"] = "efm"
if args.strand:
data["strand"] = args.strand
return data


Expand Down Expand Up @@ -86,7 +120,7 @@ def subparser(subparsers):
)
p.add_argument(
"--reference", action="store_true",
help="Use for creating Reference profiles"
help="Use for creating Reference profiles for STR workflow"
)
p.add_argument("--efm", action="store_true",help="Use to create EuroForMix profiles")
p.add_argument("--ce", action="store_true", help="Use for CE data")
Expand All @@ -100,5 +134,30 @@ def subparser(subparsers):
)
p.add_argument(
"--nofiltering", action="store_true",
help="Use to perform no filtering during the 'filter' step"
help="For STRs, use to perform no filtering during the 'filter' step. For SNPs, "
"only alleles specified as 'Typed' by the UAS will be included at the 'format' step."
)
p.add_argument(
"--snps", action="store_true",
help="Use to create a config file for the SNP workflow"
)
p.add_argument(
"--snp-type", default="all", dest="snptype",
help="Specify the type of SNPs to include in the final report. 'p' will include only the "
"Phenotype SNPs; 'a' will include only the Ancestry SNPs; 'i' will include only the "
"Identity SNPs; and 'all' will include all SNPs. More than one type can be specified (e.g. "
" 'p, a'). Default is all."
)
p.add_argument(
"--kintelligence", action="store_true",
help="Use if processing Kintelligence SNPs within a Kintellience Report(s)"
)
p.add_argument(
"--snp-reference", dest="ref",
help="Specify any references for SNP data for use in EFM."
)
p.add_argument(
"--strand", choices=["uas", "forward"],
help="Specify the strand orientation for the final output files. UAS orientation is "
"default for STRs; forward strand is default for SNPs."
)
27 changes: 19 additions & 8 deletions lusSTR/cli/snps.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# -------------------------------------------------------------------------------------------------
# Copyright (c) 2020, DHS.
# Copyright (c) 2023, DHS.
#
# This file is part of lusSTR (http://github.com/bioforensics/lusSTR) and is licensed under
# the BSD license: see LICENSE.txt.
Expand All @@ -10,16 +10,27 @@
# Development Center.
# -------------------------------------------------------------------------------------------------

import argparse
import lusSTR
import snakemake
from snakemake import snakemake

## placeholder until I update this

def main(args):
raise NotImplementedError('SNP workflow implementation pending')
pretarget = args.target if args.target != "all" else "convert"
workdir = args.workdir
result = snakemake(
lusSTR.snakefile(workflow="snps"), targets=[pretarget], workdir=workdir, verbose=True
)
if result is not True:
raise SystemError('Snakemake failed')

def subparser(subparsers):
p = subparsers.add_parser("snps", description="Running the entire STR pipeline (format, annotate and filter)")
p.add_argument("--config", default="config.yaml", help="config file used to identify settings.")
p.add_argument("-w", "--workdir", metavar="W", default=".", help="working directory")
p.add_argument("--skip-filter", dest="filter", action = "store_true", help="Skip filtering step")
p = subparsers.add_parser(
"snps", description="Running the SNP pipeline"
)
p.add_argument(
"target", choices=["format", "all"],
help="Steps to run. Specifying 'format' will run only 'format'. Specifying "
"'all' will run all steps of the SNP workflow ('format' and 'convert')."
)
p.add_argument("-w", "--workdir", metavar="W", default=".", help="working directory")
3 changes: 2 additions & 1 deletion lusSTR/data/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ profile_type: "evidence" ## evidence/reference
data_type: "ngs" ## ce/ngs
info: True ## True/False; create allele information file
separate: False ##True/False; for EFM only, if True will create individual files for samples; if False, will create one file with all samples
nofilters: False ##True/False; skip all filtering steps but still creates EFM/STRmix output files
nofilters: False ##True/False; skip all filtering steps but still creates EFM/STRmix output files
strand: uas ##uas/forward; strand orientation to report
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized I had never added the argument to specify strand orientation.

18 changes: 18 additions & 0 deletions lusSTR/data/snp_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
%YAML 1.2
---

## general settings
uas: True ## True/False; if ran through UAS
samp_input: "/path/to/input/directory/or/samples" ## input directory or sample; if not provided, will be cwd
output: "lusstr_output" ## output file/directory name; Example: "test_030923"
kit: "sigprep" ## sigprep/kintelligence

## format settings
types: "all" ## choices are "all", "i" (identity SNPs only), "p" (phenotype only), "a" (ancestry only) or any combination
nofilter: False ## True/False if no filtering is desired; if False, will remove any allele designated as Not Typed

## convert settings
strand: "forward" ## forward/uas; indicates which oritentation to report the alleles for the ForenSeq SNPs; uas indicates the orientation as reported by the UAS or the forward strand
references: "" ## list IDs of the samples to be run as references in EFM
separate: false ## True/False; if want to separate samples into individual files for use in EFM
thresh: 0.03 ## Analytical threshold value
35 changes: 24 additions & 11 deletions lusSTR/scripts/filter_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,17 @@ def get_filter_metadata_file():
filter_marker_data = json.load(fh)


def filters(locus_allele_info, locus, locus_reads, datatype):
def filters(locus_allele_info, locus, locus_reads, datatype, brack_col):
metadata = filter_marker_data[locus]
if len(locus_allele_info) == 1:
locus_allele_info = single_allele_thresholds(metadata, locus_reads, locus_allele_info)
else:
locus_allele_info, locus_reads = multiple_allele_thresholds(
metadata, locus_reads, locus_allele_info
)
locus_allele_info = ce_filtering(locus_allele_info, locus_reads, metadata, datatype)
locus_allele_info = ce_filtering(
locus_allele_info, locus_reads, metadata, datatype, brack_col
)
if datatype == "ngs":
locus_allele_info = same_size_filter(locus_allele_info, metadata)
return locus_allele_info
Expand Down Expand Up @@ -98,7 +100,7 @@ def thresholds(filter, metadata, locus_reads, quest_al_reads):
return locus_reads, True


def ce_filtering(locus_allele_info, locus_reads, metadata, datatype):
def ce_filtering(locus_allele_info, locus_reads, metadata, datatype, brack_col):
for i in range(len(locus_allele_info)): # check for stutter alleles
if locus_allele_info.loc[i, "allele_type"] != "real_allele":
continue
Expand All @@ -111,7 +113,14 @@ def ce_filtering(locus_allele_info, locus_reads, metadata, datatype):
if init_type_all == "BelowAT":
continue
locus_allele_info = allele_ident(
locus_allele_info, init_type_all, metadata, ref_allele_reads, i, j, datatype
locus_allele_info,
init_type_all,
metadata,
ref_allele_reads,
i,
j,
datatype,
brack_col,
)
for j in range(len(locus_allele_info)):
type_all = locus_allele_info.loc[j, "allele_type"]
Expand All @@ -133,13 +142,15 @@ def ce_filtering(locus_allele_info, locus_reads, metadata, datatype):
return locus_allele_info


def allele_ident(locus_allele_info, init_type_all, metadata, ref_allele_reads, i, j, datatype):
def allele_ident(
locus_allele_info, init_type_all, metadata, ref_allele_reads, i, j, datatype, brack_col
):
quest_al_reads = locus_allele_info.loc[j, "Reads"]
ref_allele = float(locus_allele_info.loc[i, "CE_Allele"])
question_allele = float(locus_allele_info.loc[j, "CE_Allele"])
if datatype == "ngs":
ref_bracket = locus_allele_info.loc[i, "UAS_Output_Bracketed_Notation"]
question_bracket = locus_allele_info.loc[j, "UAS_Output_Bracketed_Notation"]
ref_bracket = locus_allele_info.loc[i, brack_col]
question_bracket = locus_allele_info.loc[j, brack_col]
else:
ref_bracket = None
question_bracket = None
Expand All @@ -155,6 +166,7 @@ def allele_ident(locus_allele_info, init_type_all, metadata, ref_allele_reads, i
ref_bracket,
question_bracket,
datatype,
brack_col,
)
if "stutter" in locus_allele_info.loc[j, "allele_type"]:
if "/" in locus_allele_info.loc[j, "allele_type"] and pd.isnull(
Expand Down Expand Up @@ -254,6 +266,7 @@ def allele_type(
ref_bracket,
question_bracket,
datatype,
brack_col,
):
stutter_thresh = metadata["StutterThresholdDynamicPercent"]
forward_thresh = metadata["StutterForwardThresholdDynamicPercent"]
Expand All @@ -275,11 +288,11 @@ def allele_type(
)
elif allele_diff == 2 and ref_reads > quest_al_reads: # -2 stutter
allele = ce if datatype == "ce" else question_bracket
if check_2stutter(all_type_df, datatype, allele)[0] is True:
if check_2stutter(all_type_df, datatype, allele, brack_col)[0] is True:
if (
datatype == "ngs" and bracketed_stutter_id(ref_bracket, question_bracket, -2) == -2
) or datatype == "ce":
ref_reads = check_2stutter(all_type_df, datatype, allele)[1]
ref_reads = check_2stutter(all_type_df, datatype, allele, brack_col)[1]
stutter_thresh_reads = stutter_thresh * ref_reads
all_type, stut_perc = minus2_stutter(
all_type,
Expand Down Expand Up @@ -318,7 +331,7 @@ def forward_stut_thresh(perc, perc_stut, reads):
return forward_thresh


def check_2stutter(stutter_df, allele_des, allele):
def check_2stutter(stutter_df, allele_des, allele, brack_col):
is_true, reads = False, None
if "-1_stutter" in stutter_df.loc[:, "allele_type"].values:
if allele_des == "ce":
Expand All @@ -329,7 +342,7 @@ def check_2stutter(stutter_df, allele_des, allele):
break
else:
for k, row in stutter_df.iterrows():
bracket_test = stutter_df.loc[k, "UAS_Output_Bracketed_Notation"]
bracket_test = stutter_df.loc[k, brack_col]
if bracketed_stutter_id(bracket_test, allele, -1) == -1:
is_true, reads = True, stutter_df.loc[k, "Reads"]
return is_true, reads
Expand Down
Loading