Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering RU alleles #49

Merged
merged 79 commits into from
Jan 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
9f9b05e
added filters for RU allele processing
Nov 19, 2022
48d28a9
fixed efm_output function
Nov 19, 2022
310406e
fixed error in setup.py
Nov 19, 2022
6b2a8d7
fixed another error in setup.py
Nov 19, 2022
fffe040
added output file name
Nov 20, 2022
3f27d8d
added sequence info file
Nov 20, 2022
1457d4d
fixed error in generating info file
Nov 20, 2022
97dec9c
removed allele des
Nov 20, 2022
f79fc2d
added test for thresholds and fixed bug in thresholds function
Nov 21, 2022
3021b6e
Merge branch 'master' into filtering
Nov 21, 2022
9015bc0
added tests for -2/-1/+1 stutter functions
Nov 21, 2022
28c574d
added directory option for strmix; tests for strmix and efm outputs
Nov 22, 2022
ede7e03
added file for tests
Nov 22, 2022
e5ef7ba
files for tests
Nov 22, 2022
d42ef68
keep forgetting test files...
Nov 22, 2022
70b18ac
added tests for generating info file
Nov 23, 2022
4f749d2
file for tests...
Nov 23, 2022
70ebc9f
added test for stdout
Nov 23, 2022
7c9514b
changed col names for STRmix output
Nov 29, 2022
b78af50
added flagged loci file for num of alleles and intralocus imbalance
Nov 30, 2022
eb981c0
updated tests with same file
Nov 30, 2022
34573aa
added test file and fixed error in test
Nov 30, 2022
2646ece
removing old files
Nov 30, 2022
45340f0
added correct files
Nov 30, 2022
eb555ad
fixed rounding error with stutter perc
Nov 30, 2022
cc04eb0
updated filters.json file
Nov 30, 2022
da0d8e4
updated json and test file
Nov 30, 2022
d7849ef
test troubleshooting
Nov 30, 2022
3849c0e
hopefully fixing bug
Nov 30, 2022
3ef4e22
reverting to previous test code
Nov 30, 2022
f359356
added reference and evidence output types
Dec 1, 2022
32a5c13
added tests for reference profiles
Dec 1, 2022
30be9dd
updated setup.py
Dec 1, 2022
3bcd970
fixing bug in ci
Dec 1, 2022
7a13d04
fix ci build
Dec 1, 2022
3907304
fixing ci
Dec 1, 2022
cbe36c4
fixed error in setup.py
Dec 1, 2022
e621146
trying once again to fix ci
Dec 1, 2022
bc515a7
updated README
Dec 1, 2022
6f8fe17
updated cibuild
Dec 5, 2022
6602ace
updated manifest
Dec 5, 2022
b3a7e35
updated README
Dec 5, 2022
8b00bb6
remove copy of filters.py
Dec 5, 2022
ac99982
fixed bug in -2 stutter reporting; updated STRmix output to remove noise
Dec 6, 2022
6e69928
updated tests
Dec 6, 2022
493bb4b
fixed locus names and remove amel
Dec 7, 2022
cc3ee8a
updated tests
Dec 7, 2022
9137a8e
added ngs data option; same size filter
Dec 8, 2022
8e7f47c
fixed bug in same size filter
Dec 8, 2022
cf07924
fixed bug for same size filter
Dec 8, 2022
fe1afd8
changed locus names in ce file
Dec 8, 2022
4046c26
updated col names for ce files
Dec 9, 2022
dfa0594
updated cli
Dec 9, 2022
b43c912
added test for ngs, updated tests and files with newest code
Dec 9, 2022
bff7959
updated filter test
Dec 9, 2022
e0a984b
debugging test
Dec 9, 2022
e281234
updated ce strmix code to output allele as float64 object
Dec 9, 2022
19a438f
updated filter to convert NGS output to float
Dec 9, 2022
fbf57dc
updated reads type
Dec 9, 2022
49844a3
reverting to original test code
Dec 9, 2022
f22c568
changed data type of columns; updated test files
Dec 12, 2022
0e051e2
updated reference code to duplicate homozygous alleles
Dec 13, 2022
a00b0e4
updated manifest and setup.py
Dec 13, 2022
d54f877
updated README
Dec 14, 2022
d6cd992
refractor efm code
Dec 14, 2022
796dc69
update test doc
Dec 14, 2022
8c8a3fe
added flag for D7 microvariants
Dec 14, 2022
2806751
added test for flagging D7 microvariant
Dec 14, 2022
62eb4df
fixed bug with calling stutter for microvariants
Dec 15, 2022
51223a5
added test for no filters
Dec 15, 2022
d69ec36
clean up code
Dec 15, 2022
905fa76
cleaning up code
Dec 15, 2022
d255b8e
added check for specifying reference and ngs data
Dec 16, 2022
0e0ee17
added test for error message
Dec 16, 2022
2228b4f
swtiched labels
Dec 16, 2022
2dbbdf5
updated code for switching slope/intercept
Dec 19, 2022
5710a42
refractor code
Dec 22, 2022
35538da
clean up code
Jan 6, 2023
96e9279
more code refactoring
Jan 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/cibuild.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ jobs:
strategy:
max-parallel: 4
matrix:
python-version: [3.6, 3.7, 3.8]
python-version: [3.7, 3.8, 3.9]

steps:
- uses: actions/checkout@v1
Expand Down
4 changes: 4 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
include versioneer.py
include lusSTR/_version.py
include lusSTR/filters.json
include lusSTR/str_markers.json
include lusSTR/snp_data.json
include lusSTR/filters.json
include lusSTR/tests/data/*
include lusSTR/tests/data/STRait_Razor_test_output/*
include lusSTR/tests/data/UAS_bulk_input/*
include lusSTR/tests/data/snps/*
include lusSTR/tests/data/RU_stutter_test/*
include lusSTR/tests/data/NGS_stutter_test/*
48 changes: 48 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ lusSTR is a tool written in Python to convert NGS sequence data of forensic STR

This Python package has been written for use with either: (1) the 27 autosomal STR loci, 24 Y-chromosome STR loci and 7 X-chromosome STR loci from the Verogen ForenSeq panel, or (2) the 22 autosomal STR loci and 22 Y-chromosome loci from the Promega PowerSeq panel. The package accomodates either the Sample Details Report from the ForenSeq Universal Analysis Software (UAS) or STRait Razor output. If STRait Razor output is provided, sequences are filtered to the UAS sequence region for translation.

lusSTR can perform filtering and stutter identification using the RU allele for autosomal loci and create files for direct input into two probabilistic genotyping software packages, including EuroForMix (EFM) and STRmix.

lusSTR also processes SNP data from the Verogen ForenSeq panel. ForenSeq consists of 94 identity SNPs, 22 phenotype (hair/eye color) SNPs, 54 ancestry SNPs and 2 phenotype and ancestry SNPs. Identity SNP data is provided in the UAS Sample Details Report; phenotype and ancestry SNP data is provided in the UAS Phenotype Report. All SNP calls are also reported in the STRait Razor output.


Expand Down Expand Up @@ -203,6 +205,52 @@ lusstr snps UAS_files/ -o uas_output_all.txt --type all --uas
lusstr snps STRait_Razor_output/ -o strait_razor_p.txt --type p
```

## Filtering RU alleles and Creation of Files for Use in ProbGen Software

**Currently, lusSTR is only set up to filter and identify stutter based on RU alleles for autosomal loci; future work will expand into the use of the LUS+ allele/sequence string as well as for sex chromosome loci and SNPs.*

The ```filter``` command provides the opportunity to filter sequences using thresholds such as:
* Detection threshold (both static and dynamic)
* Analytical threshold (both static and dynamic)
* Same size threshold (dynamic)

Custom static and dynamic thresholds for each locus are stored in the ```filters.json``` file. This file should be updated to utilize validated thresholds for individual labs.

In addition, stutter alleles can be identified using the ```--info``` flag. This creates a separate file containing information about each allele, including an allele classification (```real allele```, ```stutter``` or ```noise```). Stutter alleles are classified as either ```-1 stutter```, ```-2 stutter```, or ```+1 stutter```. For these stutter alleles, the stuttering allele is reported along with the percent stutter (# of reads for that allele/# of reads for stuttering allele). In instances where a stutter allele could be multiple different types of stutter, all potential designations will be reported as such: ```-1 stutter/-2 stutter```, ```-1 stutter/+1 stutter```, or ```-2 stutter/+1 stutter```. No percent stutter is calculated for these alleles. If a sequence is identified as noise, the percent noise is calculated (# of reads for that sequence/total locus reads).

Each locus is checked for containing greater than 2 alleles (indicating a potential mixture) and for intralocus imbalance. If either are identified, a separate file (```Flagged_Loci.csv```) is created, containing the SampleID, Locus and either ```>2Alleles``` or ```IntraLocusImbalance```.

When using STRmix data, the data type can be specified using the ```--data-type``` flag as either ```ce``` or ```ngs``` (default is ```ce```). If ```ngs``` is specified, the same size filter is applied but the stutter filter is not (the stutter filter is currently a work in progress for NGS data!). Further, the columns and column names in the output file differ based on the data type.

Finally, output files are created for direct use in EuroForMix (EFM) or STRmix. If EFM is specified, a single file is created containing all samples in the input file (however, separate output files for each sample can be created with the ```--separate``` flag). If STRmix is specified, a directory containing files for each individual sample is created. The ```--profile-type``` flag allows for the creation of either a ```reference``` or ```evidence``` profile. Both EuroForMix and STRmix require different formatting depending on the type of sample.

### Usage
```
lusstr filter <input file> -o <output file/directory> --output-type <efm or strmix> --profile-type <evidence or reference> --info --no-filters --separate
```
The ```filter``` command requires the input of a ```.txt``` file produced by the ```lusstr annot``` command.
The ```-o/--out``` flag specifies the name of the output file (for EFM) or output directory (for STRmix)
```--output-type``` specifies the type of output file created, either ```efm``` or ```strmix```. ```efm``` is the default.
```--profile-type``` specifies the sample type, either ```evidence``` or ```reference```. ```evidence``` is the default.
```--data-type``` specifies the type of data used, either ```ce``` or ```ngs```. ```ce``` is the default. Only applicable to STRmix data.
```--info``` creates the allele information file, containing allele designations (e.g. stutter, noise or real allele) as well as stutter/noise percentages.
The ```--no-filters``` flag will not apply any filters and therefore all alleles present in the input file will be in the created output file(s).
The ```--separate``` flag will indicate to separate samples into individual output files for EFM. STRmix creates separate files by default.

**Examples**:

```
lusstr filter experiment01.txt -o experiment01_efm.csv --output-type efm --info
```

```
lusstr filter experiment01.txt -o STRmix_files/ --output-type strmix --profile-type reference --info
```

```
lusstr filter experiment01.txt -o STRmix_files/ --output-type strmix --data-type ngs --info
```

----

lusSTR is still under development and any suggestions/issues found are welcome!
2 changes: 2 additions & 0 deletions lusSTR/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
from lusSTR import marker
from lusSTR import repeat
from lusSTR import format
from lusSTR import snps
from lusSTR import filter
from lusSTR import cli
from ._version import get_versions
__version__ = get_versions()['version']
Expand Down
46 changes: 45 additions & 1 deletion lusSTR/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

import argparse
import lusSTR
from . import format, annot, snps
from . import format, annot, snps, filter


def format_subparser(subparsers):
Expand Down Expand Up @@ -105,16 +105,60 @@ def snps_subparser(subparsers):
)


def filter_subparser(subparsers):
cli = subparsers.add_parser('filter')
cli.add_argument(
'input',
help='Input is a single lusSTR output file (.txt format)'
)
cli.add_argument(
'--separate', action='store_true',
help='Used to create separate final output files for each Sample. If not used, a single '
'file containing all samples will be created.'
)
cli.add_argument(
'--info', action='store_true',
help='Use to create a text document containing additional information on filtered '
'sequences and stutter.'
)
cli.add_argument(
'--output-type', dest='output', choices=['efm', 'strmix'], default='efm',
help='Choose the file format of the output file, either "efm" or "strmix". '
'Default is efm.'
)
cli.add_argument(
'--no-filters', dest='nofilters', action='store_true',
help='Used to skip all filtering steps. All input alleles will be included in the output.'
)
cli.add_argument(
'--out', '-o', metavar='FILE',
help='Name of output file containing all samples for EFM or name/path of directory for '
'STRmix. If separate files are specified for EFM, the sample ID will be used as the '
'filename. Output files are in CSV format.'
)
cli.add_argument(
'--profile-type', dest='profile', choices=['evidence', 'reference'], default='evidence',
help='Choose the type of profile, either evidence or reference. Default is evidence.'
)
Comment on lines +139 to +142
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this question here now, in case I don't find the answer while looking at the rest of the code: how are evidence and reference profiles treated differently? Anything in addition to assuming a single diploid contributor for references?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output format (for both STRmix and EFM) is the only thing that differs (references only report two alleles and do not report the # of reads).

cli.add_argument(
'--data-type', dest='data', choices=['ngs', 'ce'], default='ce',
help='Choose the type of data, either ngs or ce. Default is ce.'
'**This is only applicable to STRmix evidence data.**'
)


mains = {
'format': lusSTR.format.main,
'annotate': lusSTR.annot.main,
'snps': lusSTR.snps.main,
'filter': lusSTR.filter.main
}

subparser_funcs = {
'format': format_subparser,
'annotate': annot_subparser,
'snps': snps_subparser,
'filter': filter_subparser
}


Expand Down
Loading