bioforensics · standage · Jan 6, 2023 · Nov 19, 2022 · Nov 19, 2022 · Nov 19, 2022
diff --git a/.github/workflows/cibuild.yml b/.github/workflows/cibuild.yml
@@ -9,7 +9,7 @@ jobs:
     strategy:
       max-parallel: 4
       matrix:
-        python-version: [3.6, 3.7, 3.8]
+        python-version: [3.7, 3.8, 3.9]
 
     steps:
     - uses: actions/checkout@v1

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,8 +1,12 @@
 include versioneer.py
 include lusSTR/_version.py
+include lusSTR/filters.json
 include lusSTR/str_markers.json
 include lusSTR/snp_data.json
+include lusSTR/filters.json
 include lusSTR/tests/data/*
 include lusSTR/tests/data/STRait_Razor_test_output/*
 include lusSTR/tests/data/UAS_bulk_input/*
 include lusSTR/tests/data/snps/*
+include lusSTR/tests/data/RU_stutter_test/*
+include lusSTR/tests/data/NGS_stutter_test/*
diff --git a/README.md b/README.md
@@ -4,6 +4,8 @@ lusSTR is a tool written in Python to convert NGS sequence data of forensic STR
 
 This Python package has been written for use with either: (1) the 27 autosomal STR loci, 24 Y-chromosome STR loci and 7 X-chromosome STR loci from the Verogen ForenSeq panel, or (2) the 22 autosomal STR loci and 22 Y-chromosome loci from the Promega PowerSeq panel. The package accomodates either the Sample Details Report from the ForenSeq Universal Analysis Software (UAS) or STRait Razor output. If STRait Razor output is provided, sequences are filtered to the UAS sequence region for translation.
 
+lusSTR can perform filtering and stutter identification using the RU allele for autosomal loci and create files for direct input into two probabilistic genotyping software packages, including EuroForMix (EFM) and STRmix.
+
 lusSTR also processes SNP data from the Verogen ForenSeq panel. ForenSeq consists of 94 identity SNPs, 22 phenotype (hair/eye color) SNPs, 54 ancestry SNPs and 2 phenotype and ancestry SNPs. Identity SNP data is provided in the UAS Sample Details Report; phenotype and ancestry SNP data is provided in the UAS Phenotype Report. All SNP calls are also reported in the STRait Razor output.
 
 
@@ -203,6 +205,52 @@ lusstr snps UAS_files/ -o uas_output_all.txt --type all --uas
 lusstr snps STRait_Razor_output/ -o strait_razor_p.txt --type p
 ```
 
+## Filtering RU alleles and Creation of Files for Use in ProbGen Software
+
+**Currently, lusSTR is only set up to filter and identify stutter based on RU alleles for autosomal loci; future work will expand into the use of the LUS+ allele/sequence string as well as for sex chromosome loci and SNPs.*
+
+The ```filter``` command provides the opportunity to filter sequences using thresholds such as:
+* Detection threshold (both static and dynamic)
+* Analytical threshold (both static and dynamic)
+* Same size threshold (dynamic)
+
+Custom static and dynamic thresholds for each locus are stored in the ```filters.json``` file. This file should be updated to utilize validated thresholds for individual labs.
+
+In addition, stutter alleles can be identified using the ```--info``` flag. This creates a separate file containing information about each allele, including an allele classification (```real allele```, ```stutter``` or ```noise```). Stutter alleles are classified as either ```-1 stutter```, ```-2 stutter```, or ```+1 stutter```. For these stutter alleles, the stuttering allele is reported along with the percent stutter (# of reads for that allele/# of reads for stuttering allele). In instances where a stutter allele could be multiple different types of stutter, all potential designations will be reported as such: ```-1 stutter/-2 stutter```, ```-1 stutter/+1 stutter```, or ```-2 stutter/+1 stutter```. No percent stutter is calculated for these alleles. If a sequence is identified as noise, the percent noise is calculated (# of reads for that sequence/total locus reads).
+
+Each locus is checked for containing greater than 2 alleles (indicating a potential mixture) and for intralocus imbalance. If either are identified, a separate file (```Flagged_Loci.csv```) is created, containing the SampleID, Locus and either ```>2Alleles``` or ```IntraLocusImbalance```.
+
+When using STRmix data, the data type can be specified using the ```--data-type``` flag as either ```ce``` or ```ngs``` (default is ```ce```). If ```ngs``` is specified, the same size filter is applied but the stutter filter is not (the stutter filter is currently a work in progress for NGS data!). Further, the columns and column names in the output file differ based on the data type.
+
+Finally, output files are created for direct use in EuroForMix (EFM) or STRmix. If EFM is specified, a single file is created containing all samples in the input file (however, separate output files for each sample can be created with the ```--separate``` flag). If STRmix is specified, a directory containing files for each individual sample is created. The ```--profile-type``` flag allows for the creation of either a ```reference``` or ```evidence``` profile. Both EuroForMix and STRmix require different formatting depending on the type of sample. 
+
+### Usage
+```
+lusstr filter <input file> -o <output file/directory> --output-type <efm or strmix> --profile-type <evidence or reference> --info --no-filters --separate
+```
+The ```filter``` command requires the input of a ```.txt``` file produced by the ```lusstr annot``` command.
+The ```-o/--out``` flag specifies the name of the output file (for EFM) or output directory (for STRmix)
+```--output-type``` specifies the type of output file created, either ```efm``` or ```strmix```. ```efm``` is the default.
+```--profile-type``` specifies the sample type, either ```evidence``` or ```reference```. ```evidence``` is the default.
+```--data-type``` specifies the type of data used, either ```ce``` or ```ngs```. ```ce``` is the default. Only applicable to STRmix data.
+```--info``` creates the allele information file, containing allele designations (e.g. stutter, noise or real allele) as well as stutter/noise percentages.
+The ```--no-filters``` flag will not apply any filters and therefore all alleles present in the input file will be in the created output file(s).
+The ```--separate``` flag will indicate to separate samples into individual output files for EFM. STRmix creates separate files by default.
+
+**Examples**:
+
+```
+lusstr filter experiment01.txt -o experiment01_efm.csv --output-type efm --info
+```
+
+```
+lusstr filter experiment01.txt -o STRmix_files/ --output-type strmix --profile-type reference --info
+```
+
+```
+lusstr filter experiment01.txt -o STRmix_files/ --output-type strmix --data-type ngs --info
+```
+
 ----
 
 lusSTR is still under development and any suggestions/issues found are welcome!
diff --git a/lusSTR/__init__.py b/lusSTR/__init__.py
@@ -11,6 +11,8 @@
 from lusSTR import marker
 from lusSTR import repeat
 from lusSTR import format
+from lusSTR import snps
+from lusSTR import filter
 from lusSTR import cli
 from ._version import get_versions
 __version__ = get_versions()['version']

diff --git a/lusSTR/cli.py b/lusSTR/cli.py
@@ -9,7 +9,7 @@
 
 import argparse
 import lusSTR
-from . import format, annot, snps
+from . import format, annot, snps, filter
 
 
 def format_subparser(subparsers):
@@ -105,16 +105,60 @@ def snps_subparser(subparsers):
     )
 
 
+def filter_subparser(subparsers):
+    cli = subparsers.add_parser('filter')
+    cli.add_argument(
+        'input',
+        help='Input is a single lusSTR output file (.txt format)'
+    )
+    cli.add_argument(
+        '--separate', action='store_true',
+        help='Used to create separate final output files for each Sample. If not used, a single '
+        'file containing all samples will be created.'
+    )
+    cli.add_argument(
+        '--info', action='store_true',
+        help='Use to create a text document containing additional information on filtered '
+        'sequences and stutter.'
+    )
+    cli.add_argument(
+        '--output-type', dest='output', choices=['efm', 'strmix'], default='efm',
+        help='Choose the file format of the output file, either "efm" or "strmix". '
+        'Default is efm.'
+    )
+    cli.add_argument(
+        '--no-filters', dest='nofilters', action='store_true',
+        help='Used to skip all filtering steps. All input alleles will be included in the output.'
+    )
+    cli.add_argument(
+        '--out', '-o', metavar='FILE',
+        help='Name of output file containing all samples for EFM or name/path of directory for '
+        'STRmix. If separate files are specified for EFM, the sample ID will be used as the '
+        'filename. Output files are in CSV format.'
+    )
+    cli.add_argument(
+        '--profile-type', dest='profile', choices=['evidence', 'reference'], default='evidence',
+        help='Choose the type of profile, either evidence or reference. Default is evidence.'
+    )
+    cli.add_argument(
+        '--data-type', dest='data', choices=['ngs', 'ce'], default='ce',
+        help='Choose the type of data, either ngs or ce. Default is ce.'
+        '**This is only applicable to STRmix evidence data.**'
+    )
+
+
 mains = {
     'format': lusSTR.format.main,
     'annotate': lusSTR.annot.main,
     'snps': lusSTR.snps.main,
+    'filter': lusSTR.filter.main
 }
 
 subparser_funcs = {
     'format': format_subparser,
     'annotate': annot_subparser,
     'snps': snps_subparser,
+    'filter': filter_subparser
 }