Update terminology in README and CLI, remove amelogenin in annotate (#42

)
bioforensics · May 11, 2021 · 4eaab0a · 4eaab0a
1 parent 83a6170
commit 4eaab0a
Show file tree

Hide file tree

Showing 16 changed files with 27,777 additions and 33,503 deletions.
diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
 # lusSTR
 
-lusSTR is a tool written in Python to convert NGS sequence data of forensic STR loci to different annotation types for ease in downstream analyses.
+lusSTR is a tool written in Python to convert NGS sequence data of forensic STR loci to different sequence representations and allele designations for ease in downstream analyses.
 
-This Python package has been written for use with either: (1) the 27 autosomal STR loci, 24 Y-chromosome STR loci and 7 X-chromosome STR loci from the Verogen ForenSeq panel, or (2) the 22 autosomal STR loci and 22 Y-chromosome loci from the Promega PowerSeq panel. The package accomodates either the Sample Details Report from the ForenSeq Universal Analysis Software (UAS) or STRait Razor output. If STRait Razor output is provided, sequences are filtered to the UAS sequence region for annotation.
+This Python package has been written for use with either: (1) the 27 autosomal STR loci, 24 Y-chromosome STR loci and 7 X-chromosome STR loci from the Verogen ForenSeq panel, or (2) the 22 autosomal STR loci and 22 Y-chromosome loci from the Promega PowerSeq panel. The package accomodates either the Sample Details Report from the ForenSeq Universal Analysis Software (UAS) or STRait Razor output. If STRait Razor output is provided, sequences are filtered to the UAS sequence region for translation.
 
 lusSTR also processes SNP data from the Verogen ForenSeq panel. ForenSeq consists of 94 identity SNPs, 22 phenotype (hair/eye color) SNPs, 54 ancestry SNPs and 2 phenotype and ancestry SNPs. Identity SNP data is provided in the UAS Sample Details Report; phenotype and ancestry SNP data is provided in the UAS Phenotype Report. All SNP calls are also reported in the STRait Razor output.
 
@@ -71,16 +71,19 @@ The above command will output two tables which are used in the ```annotate``` co
 
 If using lusSTR version 0.4 or above, STRait Razor data **must** be produced using the STRait Razor config file released in January 2021 (ForenSeqv1.25.config and PowerSeqv2.1.config). These config files are available here: https://github.com/Ahhgust/STRaitRazor/tree/103ef68746f010add8f21266fa8bf8fb9f4a076e/.
 
-If using the output from STRait Razor, the files **must** be labeled as ```SampleID.txt``` (example: ```Sample0001.txt```) and **must** be compiled in a separate folder (labeled with the project ID). The user must specify the folder name for the ```format``` command as well as an output filename (all sample files will be compiled into one file):
+If using the output from STRait Razor, the files **must** be labeled as ```SampleID.txt``` (example: ```Sample0001.txt```) and can either be specified as a single file or as a folder of multiple STRait Razor output files (folder labeled with the project ID). The user must specify the file or folder name for the ```format``` command as well as an output filename (all sample files will be compiled into one file):
 ```
 lusstr format <input> -o <output>
 ```
 
-Example:
+Examples:
 
 ```
 lusstr format STRaitRazorOutputFolder/ -o STRaitRazor_test_file.csv
 ```
+```
+lusstr format A001.txt -o A001.csv
+```
 
 Again, sex loci can be included using the ```--include-sex``` flag.
 ```
@@ -89,7 +92,7 @@ lusstr format STRaitRazorOutputFolder/ -o STRaitRazor_test_file.csv --include-se
 With this, two tables will be produced: ```STRaitRazor_test_file.csv``` and ```STRaitRazor_test_file_sex_loci.csv```.
 
 
-### Annotation of STR loci sequences
+### Translation of STR loci sequences
 
 The ```annotate``` command produces a tab-delineated table with the following columns:
 *  Sample ID
@@ -98,15 +101,17 @@ The ```annotate``` command produces a tab-delineated table with the following co
 *  Locus
 *  UAS Output sequence: can be forward or reverse strand
 *  Forward strand sequence: will be same as UAS Output sequence for those loci reported on forward strand
-*  Traditional STR allele: common repeat unit based annotation
-*  Forward Strand Bracketed annotation: Bracketed annotation for forward strand sequence
-*  UAS Output Bracketed annotation: Bracketed annotation for the reported UAS sequence output (will be same for those loci which report the forward strand)
+*  RU allele: common length-based repeat unit (RU) allele designation
+*  Forward Strand Bracketed notation: Bracketed notation for forward strand sequence
+*  UAS Output Bracketed notation: Bracketed annotation for the reported UAS sequence output (will be same for those loci which report the forward strand)
 *  LUS: Longest uninterrupted stretch
-*  LUS+: annotation combining multiple annotations including traditional STR allele designation, LUS, secondary motif (if applicable) and tertiary motif (if applicable)
+*  LUS+: Notation combining multiple allele designations including RU, LUS, secondary motif (if applicable) and tertiary motif (if applicable)
 *  Reads: number of reads observed with the specified sequence
 
 If the ```--include-sex``` flag is included, a second table with the above columns for the sex chromosome loci will be outputted as well.
 
+**NOTE** on including the sex chromosome STR loci: in the ```annotate``` step, lusSTR requires two files for input: (1) the properly formatted file of autosomal STR loci produced from the ```format``` command (or a file with the appropriate format) with a label such as ```lusSTRinput.csv```, and (2) the properly formatted file of X- and Y-STR loci produced from the ```format``` command with the ```--include-sex``` flag (or a file with the appropriate format) labeled as ```lusSTRinput_sexloci.csv```. The file containing the X- and Y-STR loci *must* have the identical file name to the file containing the autosomal STRs but with ```_sexloci.csv``` (see above for precise examples). These two files are automatically created (and named appropriately) when using the ```--include-sex``` flag with the ```format``` command.
+
 For the ```annotate``` command, the following must be specified:
 *  Input filename
 *  Output filename
@@ -124,7 +129,7 @@ lusstr annotate UAS_test_file.csv -o UAS_final_table.txt --kit forenseq --uas
 ```
 
 If no ```--uas``` flag is provided, several additional processes occur with the ```annotate``` command:
-*  The full sequences are filtered to the UAS region before the annotation step. The number of bases to remove is determined based on the specified kit.
+*  The full sequences are filtered to the UAS region before the translation step. The number of bases to remove is determined based on the specified kit.
 *  Once the sequences are filtered to the UAS region, any duplicated sequences are removed and their reads are summed in with the remaining sequence ```Reads``` column. NOTE: This step can be skipped with the ```--nocombine``` flag.
 
 Further, a second table (labeled as ```*_flanks_anno.txt```) containing information related to the flanking sequences surrounding the UAS sequence region is also produced with the following columns:
@@ -135,9 +140,9 @@ Further, a second table (labeled as ```*_flanks_anno.txt```) containing informat
 *  Reads: number of reads observed for the specified sequence
 *  Length-based Allele
 *  Full Sequence
-*  5' Flanking Sequence Annotation
-*  UAS Region Sequence Annotation (same as column ```UAS Output Bracketed annotation``` in the main table)
-*  3' Flanking Sequence Annotation
+*  5' Flanking Sequence Bracketed Notation
+*  UAS Region Sequence Bracketed Notation (same as column ```UAS Output Bracketed Notation``` in the main table)
+*  3' Flanking Sequence Bracketed Notation
 *  Potential Issues (such as: Possible indel or partial sequence)
 
 The ```Potential Issues``` column in this report is to draw attention to potential problem sequences (due to perhaps an indel or partial sequence) and requires the attention of the user to further evaluate the sequence for it's authenticity.
@@ -152,7 +157,7 @@ If the ```--include-sex``` flag is included, as below:
 ```
 lusstr annotate STRaitRazor_test_file.csv -o STRaitRazor_powerseq_final.txt --kit powerseq --include-sex
 ```
- Two additional tables will be produced: (1) ```STRaitRazor_powerseq_final_sexloci.txt``` and (2) ```STRaitRazor_powerseq_final_sexloci_flanks_anno.txt``` for annotation of the sex chromosome loci and their flanking regions.
+ Two additional tables will be produced: (1) ```STRaitRazor_powerseq_final_sexloci.txt``` and (2) ```STRaitRazor_powerseq_final_sexloci_flanks_anno.txt``` for translation of the sex chromosome loci and their flanking regions.
 
  ## SNP Data Processing
 

diff --git a/lusSTR/annot.py b/lusSTR/annot.py
@@ -48,7 +48,7 @@ def format_table(input, uas=False, kit='forenseq'):
     '''
     Function to format final output table and the flanking report (if necessary).
     '''
-    data = pd.read_csv(input)
+    data = pd.read_csv(input, keep_default_na=False)
     data.iloc[:, 3] = data.iloc[:, 3].astype(str)
     list_of_lists = []
     flanks_list = []
@@ -69,6 +69,8 @@ def format_table(input, uas=False, kit='forenseq'):
             locus = 'PENTA E'
         if locus == 'DYS385A-B' or locus == 'DYS385':
             locus = 'DYS385A-B'
+        if locus == 'AMELOGENIN':
+            continue
         metadata = str_marker_data[locus]
         if kit == 'forenseq':
             remove_5p = metadata['Foren_5']
@@ -104,41 +106,71 @@ def format_table(input, uas=False, kit='forenseq'):
 
     columns = [
         'SampleID', 'Project', 'Analysis', 'Locus', 'UAS_Output_Sequence',
-        'Forward_Strand_Sequence', 'Traditional_STR_Allele', 'Forward_Strand_Bracketed_form',
-        'UAS_Output_Bracketed_Form', 'LUS', 'LUS_Plus', 'Reads'
+        'Forward_Strand_Sequence', 'RU_Allele', 'Forward_Strand_Bracketed_Notation',
+        'UAS_Output_Bracketed_Notation', 'LUS', 'LUS_Plus', 'Reads'
     ]
-    final_output = pd.DataFrame(list_of_lists, columns=columns)
+    if not list_of_lists:
+        final_output = pd.DataFrame(list_of_lists, columns=columns)
+    else:
+        final_output = sort_table(pd.DataFrame(list_of_lists, columns=columns))
     if not uas:
         flanks_columns = [
-            'SampleID', 'Project', 'Analysis', 'Locus', 'Reads', 'Length_Allele',
-            'Full_Sequence', '5_Flank_Anno', 'UAS_Region_Anno', '3_Flank_Anno',
-            'Potential_Issues'
+            'SampleID', 'Project', 'Analysis', 'Locus', 'Reads', 'RU_Allele',
+            'Full_Sequence', '5_Flank_Bracketed_Notation', 'UAS_Region_Bracketed_Notation',
+            '3_Flank_Bracketed_Notation', 'Potential_Issues'
         ]
-        final_flank_output = pd.DataFrame(flanks_list, columns=flanks_columns)
+        if not flanks_list:
+            final_flank_output = pd.DataFrame(flanks_list, columns=flanks_columns)
+        else:
+            final_flank_output = sort_table(pd.DataFrame(flanks_list, columns=flanks_columns))
     else:
         final_flank_output = ''
     return final_output, final_flank_output, columns
 
 
+def combine_reads(table, columns):
+    comb_table = table.groupby(columns[:-1], as_index=False)['Reads'].sum()
+    sorted = sort_table(comb_table)
+    return sorted
+
+
+def sort_table(table):
+    sorted_table = table.sort_values(
+                by=['SampleID', 'Project', 'Analysis', 'Locus', 'Reads', 'RU_Allele'],
+                ascending=False
+            )
+    return sorted_table
+
+
 def main(args):
     output_name = os.path.splitext(args.out)[0]
     input_name = os.path.splitext(args.input)[0]
     autosomal_final_table, autosomal_flank_table, columns = format_table(
         args.input, args.uas, args.kit
     )
     if args.sex:
-        sex_final_table, sex_flank_table, sex_columns = format_table(
+        sex_final_table, sex_flank_table, columns = format_table(
             f'{input_name}_sexloci.csv', args.uas, args.kit
         )
-        sex_final_table.to_csv(f'{output_name}_sexloci.txt', sep='\t', index=False)
         if not args.uas:
-            sex_flank_table.to_csv(f'{output_name}_sexloci_flanks_anno.txt', sep='\t', index=False)
+            sex_flank_table.to_csv(
+                f'{output_name}_sexloci_flanks_anno.txt', sep='\t', index=False
+            )
+            if args.combine:
+                if not sex_final_table.empty:
+                    sex_final_table = combine_reads(sex_final_table, columns)
+                sex_final_table.to_csv(f'{output_name}_sexloci.txt', sep='\t', index=False)
+            else:
+                sex_final_table.to_csv(
+                    f'{output_name}_sexloci_no_combined_reads.txt', index=False
+                )
+        else:
+            sex_final_table.to_csv(f'{output_name}_sexloci.txt', sep='\t', index=False)
     if not args.uas:
         autosomal_flank_table.to_csv(f'{output_name}_flanks_anno.txt', sep='\t', index=False)
         if args.combine:
-            autosomal_final_table = (
-                autosomal_final_table.groupby(columns[:-1], as_index=False)['Reads'].sum()
-            )
+            if not autosomal_final_table.empty:
+                autosomal_final_table = combine_reads(autosomal_final_table, columns)
             autosomal_final_table.to_csv(args.out, sep='\t', index=False)
         else:
             autosomal_final_table.to_csv(

diff --git a/lusSTR/cli.py b/lusSTR/cli.py
@@ -20,12 +20,13 @@ def format_subparser(subparsers):
     )
     cli.add_argument(
         'input',
-        help='Input is either a single file (UAS Sample Details Report, in .xlsx format) or a '
-        'directory of STRait Razor output files. If input is the UAS Sample Details Report '
-        '(in .xlsx format), use of the --uas flag is required. If STRait Razor output is '
-        'used, the name of the provided directory will be used as the Analysis ID in the '
-        'final annotation table. Output files within the directory should be named as such: '
-        'SampleID_STRaitRazor.txt (e.g. A001_STRaitRazor.txt).'
+        help='Input is either a UAS Sample Details Report (in .xlsx format) or a STRait Razor '
+        'output file (.txt format). Both single files and directories containing multiple UAS or '
+        'STRait Razor files are accepted. If input is the UAS Sample Details Report, use of the '
+        '--uas flag is required. If a directory of STRait Razor files is provided, the name of '
+        'the directory will be used as the Project and Analysis IDs in the final annotation '
+        'table. If a single file is provided, the Project and Analysis IDs will be NA. '
+        'STRaitRazor files should be named as the Sample ID, e.g. A001.txt, A002.txt, etc.'
     )
     cli.add_argument(
         '--uas', action='store_true',

diff --git a/lusSTR/format.py b/lusSTR/format.py
@@ -63,7 +63,7 @@ def uas_format(infile, sexloci=False):
     return auto_strs, sex_strs
 
 
-def strait_razor_concat(indir, sexloci=False):
+def strait_razor_concat(inpath, sexloci=False):
     '''Format a directory of STRait Razor output files for use with `lusSTR annotate`.'''
     locus_list = [
         'CSF1PO', 'D10S1248', 'D12S391', 'D13S317', 'D16S539', 'D17S1301', 'D18S51', 'D19S433',
@@ -80,33 +80,47 @@ def strait_razor_concat(indir, sexloci=False):
     ]
     auto_strs = pd.DataFrame()
     sex_strs = pd.DataFrame() if sexloci is True else None
-    analysisID = os.path.basename(indir.rstrip(os.sep))
-    files = glob.glob(os.path.join(indir, '*.txt'))
-    for filename in sorted(files):
-        name = filename.replace('.txt', '').split(os.sep)[-1]
-        table = pd.read_csv(
-            filename, sep='\t', header=None,
-            names=['Locus_allele', 'Length', 'Sequence', 'Forward_Reads', 'Reverse_Reads']
-        )
-        try:
-            table[['Locus', 'Allele']] = table.Locus_allele.str.split(":", expand=True)
-        except ValueError:
-            print(
-                f'Error found with {filename}. Will bypass and continue. Please check file'
-                f' and rerun the command, if necessary.'
-            )
-            continue
-        table['Total_Reads'] = table['Forward_Reads'] + table['Reverse_Reads']
-        table['SampleID'] = name
-        table['Project'] = analysisID
-        table['Analysis'] = analysisID
-        table = table[['Locus', 'Total_Reads', 'Sequence', 'SampleID', 'Project', 'Analysis']]
+    if os.path.isdir(inpath):
+        analysisID = os.path.basename(inpath.rstrip(os.sep))
+        files = glob.glob(os.path.join(inpath, '*.txt'))
+        for filename in sorted(files):
+            try:
+                table = strait_razor_table(filename, analysisID, sexloci)
+            except ValueError:
+                print(
+                    f'Error found with {filename}. Will bypass and continue. Please check file'
+                    f' and rerun the command, if necessary.'
+                )
+                continue
+            auto_strs = auto_strs.append(table[table.Locus.isin(locus_list)])
+            if sexloci is True:
+                sex_strs = sex_strs.append(table[table.Locus.isin(sex_locus_list)])
+    else:
+        table = strait_razor_table(inpath, 'NA', sexloci)
         auto_strs = auto_strs.append(table[table.Locus.isin(locus_list)])
         if sexloci is True:
             sex_strs = sex_strs.append(table[table.Locus.isin(sex_locus_list)])
     return auto_strs, sex_strs
 
 
+def strait_razor_table(filename, analysisID, sexloci=False):
+    name = filename.replace('.txt', '').split(os.sep)[-1]
+    table = pd.read_csv(
+        filename, sep='\t', header=None,
+        names=['Locus_allele', 'Length', 'Sequence', 'Forward_Reads', 'Reverse_Reads']
+    )
+    try:
+        table[['Locus', 'Allele']] = table.Locus_allele.str.split(":", expand=True)
+    except ValueError:
+        table = []
+    table['Total_Reads'] = table['Forward_Reads'] + table['Reverse_Reads']
+    table['SampleID'] = name
+    table['Project'] = analysisID
+    table['Analysis'] = analysisID
+    table = table[['Locus', 'Total_Reads', 'Sequence', 'SampleID', 'Project', 'Analysis']]
+    return table
+
+
 def main(args):
     if args.uas:
         results, sex_results = uas_load(args.input, args.sex)

diff --git a/lusSTR/tests/data/2800M_full_anno_no_combined_reads.txt b/lusSTR/tests/data/2800M_full_anno_no_combined_reads.txt
@@ -1,4 +1,4 @@
-SampleID	Project	Analysis	Locus	UAS_Output_Sequence	Forward_Strand_Sequence	Traditional_STR_Allele	Forward_Strand_Bracketed_form	UAS_Output_Bracketed_Form	LUS	LUS_Plus	Reads
+SampleID	Project	Analysis	Locus	UAS_Output_Sequence	Forward_Strand_Sequence	RU_Allele	Forward_Strand_Bracketed_Notation	UAS_Output_Bracketed_Notation	LUS	LUS_Plus	Reads
 A01	NA	NA	CSF1PO	AGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGAT	ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCT	12	[ATCT]12	[AGAT]12	12_12	12_12_0	100
 A01	NA	NA	D10S1248	GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA	GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA	13	[GGAA]13	[GGAA]13	13_13	13_13	100
 A01	NA	NA	D10S1248	GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA	GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA	15	[GGAA]15	[GGAA]15	15_15	15_15	100