forked from ridgelab/SA-SSR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
304 lines (224 loc) · 12.7 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
README
======
Table of Contents
-----------------
I. Introduction
II. Installation Instructions
III. Usage Instructions and Examples
IV. Funding and Acknowledgements
V. Contact
I. Introduction
---------------
SA-SSR is a software tool developed to find Simple Sequence Repeats (SSRs) in
a sequence (presumably of DNA or RNA). SSRs are sometimes referred to as Short
Tandem Repeats (STRs) or microsatellites. SSRs are genetic markers with several
interesting and meaningful biological implications. For example, SSRs can play
significant roles in species identification and in genome alignment against a
reference and species identification.
Many software tools exist for this purpose, but vary widely in utility. Some
key features of our tool are as follows:
- Fast run time (linear, O(n), time complexity)
- Memory efficient (linear, O(n), space complexity)
- Finds all perfect repeats
- Simple command-line interface, convenient for scripting and when running
on High-Performance Computing (HPC) systems (note: no GUI provided)
- Easily parsed, tab-delimited output
- Runs on Linux (not Windows or Mac OS X)
See our paper in Bioinformatics for further information:
http://bioinformatics.oxfordjournals.org/content/early/2016/05/11/bioinformatics.btw298
II. Installation Instructions
-----------------------------
To compile SA-SSR, simply type `make'. The binary will be in the `bin'
directory. Your compiler must support C++11.
To compile and install, type `make' followed by `make install'. The binary
will be in both the `bin' and `/usr/local/bin' directories (to change this,
change the `PREFIX' variable in `Makefile').
To uninstall, type `make clean'.
See `INSTALL' for further instructions.
III. Usage Instructions and Examples
-------------------------------------
Please run the software with the `--help' option for complete usage instructions
(i.e., type `sa-ssr -h' or `sa-ssr --help'). The format of the input and
output files is described below; followed by usage examples.
Input File Format:
Description:
The input file must be in Fasta format. The sequences may be on a single
line or multiple lines. The header and sequence lines must not contain
any leading or trailing whitespace. The header line must not contain any
tabs. The sequence lines must not contain whitespace between
nucleotides. Mixed-case nucleotides are acceptable; they will be
replaced with uppercase nucleotides for finding SSRs.
Good Example:
>sequence 1
AGCGTGTCGTGTACACGTGTACGTACGTACGATCGATGCTACGTAGCATCGATCGACGTATCGTATCGATC
CACGTGTACGTACGTACGATCGATGCTACGTAGCATCGATCGACGTATCGTATCGATCAGCGTGTCGTGTA
.
.
.
>sequence 2
cgtacgatcgatgctacgtagcatcgatcgacgtatcgtatcgatcagcgtgtcgtgtacacgtgtacgta
gatcgacgtatcgtatcgatcagcgtgtcgtgtacacgtgtacgtacgtacgatcgatgctacgtagcatc
.
.
.
>sequence 3
taTCgATCAGCGtGTCGTGTAcACGTGTACGTAcgtaCGAtCgATGCTACGTagcatCGATCGACGTATCG
cgtacgtacgATCGATGCTACGTaGCATCGaTCGaCGTAtCGTAtcgatcaGCGTGTCGTGTAcacgtgta
.
.
.
Output File Format:
Description:
The output file is tab-delimited. Each row is a separate record. By
default, if no SSRs are found in a fasta sequence, no output record
will be present in the output file for that fasta sequence. When SSRs
are found, one output record will be present in the output file for
each SSR in that sequence. Each column contains a separate piece of
information for the output record. The default columns (in order) are:
Sequence_Name, SSR, Repeats, Position (zero-based)
By using the correct command-line options, an additional column may be
added between `Sequence_Name' and `SSR' containing the entire sequence
associated with the `Sequence_Name'. The columns would then become:
Sequence_Name, Sequence, SSR, Repeats, Position (zero-based)
Each of the columns are described below:
Sequence_Name: The entire header line from the fasta file (excluding the
leading `>' character) from which the SSR in the given
output record is found.
Sequence: The entire sequence from the fasta file in which the
SSR in the given output record is found. The sequence
will be uppercase. If taken from a multi-line sequence
in the fasta file, it will be joined into one continuous
sequence. Note, this column will only be added if
requested by command-line option.
SSR: The repeating unit of an SSR. For example, `ACACAC' is
an SSR with a repeating unit of `AC'.
Repeats: The number of times the repeating unit of an SSR repeats.
For example, the repeating unit of `AC' from SSR `ACACAC'
repeats 3 times.
Position: The zero-based position in the original fasta sequence
where the SSR begins. For example, `ACACAC' begins at
position 2 in the sequence `TGACACACGT'.
If desired, the output file may also contain an output record for the
fasta sequences where no SSRs were found. Consult the command-line
options to see how to do this.
As a simple illustration of the output file, consider the following
input file and its respective output file:
Input:
>seq 1
TGACACACGT
>seq 2
acgtg
tgtca
cagtc
Output (formatted here for readability):
Sequence_Name SSR Repeats Position (zero-based)
seq 1 AC 3 2
seq 2 GT 3 2
seq 2 CA 2 8
Usage Examples:
Basic Usage Examples:
Example 1:
sa-ssr input.fasta output.tsv
Run SA-SSR on fasta sequences in `input.fasta' and write results to
`output.tsv'. This will use the default parameters; run the software
with `--help' to see the defaults and check if they meet your needs.
Example 2:
sa-ssr -l 50 -L 200 input.fasta output.tsv
Run SA-SSR on fasta sequences in `input.fasta' that are between
50 and 200 base pairs long; i.e., do not search for SSRs in fasta
sequences shorter than 50 or longer than 200 bps. All results
will be in `output.tsv'.
Example 3:
sa-ssr -m 2 -M 9 -r 3 -R 20 input.fasta output.tsv
Run SA-SSR on fasta sequences in `input.fasta' and write results to
`output.tsv'. Only SSRs meeting the parameters provided will be
included. The meaning of each option is as follows:
-m, -M The min (-m) and max (-M) length of the repeating unit.
As examples: `AAAAAA...' and `ACGCAGTTGCACGCAGTTGC...'
would not make the cutoff because `A' is shorter than
2 nucleotides (-m 2) and `ACGCAGTTGC' is longer than 9
nucleotides (-M 9). However, `ACACAC...' and
`AATCCTGGTAATCCTGGT...' would be included.
-r, -R The min (-r) and max (-R) times the repeating unit
repeats. As examples: `ACAC' and `ACACACACACACACACACACAC
ACACACACACACACACACAC' (21 `AC' units) would not make the
cutoff. However, `ACACAC' and `ACACACACACACACACACACACACAC
ACACACACACACAC' (20 `AC' units) would be included.
Example 4:
sa-ssr -t 4 input.fasta output.tsv
Run SA-SSR on fasta sequences in `input.fasta' and write results to
`output.tsv'. Use 4 threads of execution, instead of the default 1.
Example 5:
sa-ssr -l 50 -L 200 -m 2 -M 9 -r 3 -R 20 -t 4 input.fasta output.tsv
Run SA-SSR on fasta sequences in `input.fasta' and write results to
`output.tsv'. This is a combination of examples 2, 3, and 4.
Extended Example:
The data for this extended example can be found in the `examples'
directory. The input fasta file is `input.fasta'. The output files are
`output_default.tsv' and `output.tsv'. The data and example are
fictitious, but was modified from real sequences. The numbers used for
min/max sequence length or any other parameter may not be biologically
realistic; however, they do conceptually represent a realistic
situation.
input.fasta: Each fasta sequence is a contig generated from a de novo
assembly of whole exome sequencing reads. The file has 16
fasta sequences.
output*.tsv: The output files from running SA-SSR as shown below.
One could just run SA-SSR with the defaults:
sa-ssr input.fasta output_default.tsv
However, the defaults may not be the best parameters for your data.
Below is a reasonable command to use with this example data. Each
alteration from the default is explained and justified based on the
data in this example.
sa-ssr -l 70 -L 300 -m 2 -M 9 -r 2 -t 4 -f input.fasta output.tsv
-l 70 Do not search for SSRs in fasta sequences less than 70 bps in
length. For sake of the example, we assume based on our reads
and the assembler's parameters and intricacies that the
resulting contigs should be equal to or longer than 70 bps.
Anything shorter would be erroneous and not worth searching for
SSRs.
-L 300 Do not search for SSRs in fasta sequences more than 300 bps in
length. For sake of the example, we assume based on our reads
and the assembler's parameters and intricacies that the
resulting contigs should be equal to or shorter than 300 bps.
Anything longer would be erroneous and not worth searching for
SSRs.
-m 2 Do not report SSRs where the base unit is less than 2 bps
in length. This decision must be made based on your research
question(s) and what you deem biologically interesting.
-M 9 Do not report SSRs where the base unit is more than 9 bps
in length. This decision must be made based on your research
question(s) and what you deem biologically interesting.
-r 2 Do not report SSRs where the base unit repeats less than 2
times. This decision must be made based on your research
question(s) and what you deem biologically interesting.
-t 4 Use 4 threads of execution instead of the default 1. This must
be based on the machine you run SA-SSR on. Obviously, it
doesn't make sense to run with 9 threads on an 8 thread
machine.
-f Output the full sequence from the fasta file from which an SSR
in a particular output record was found. This may be desirable
if your downstream analysis requires knowing the full sequence
from which the SSR was found. This option is intended to ease
downstream analysis by removing the requirement to parse the
original fasta file and the output file simultaneously. If you
do not need this, do not use this option as it will radically
increase the size of the output file.
If you inspect these two output files, you'll observe 12 output records
in `output_default.tsv' and 7 output records in `output.tsv'. Using the
custom parameters removed 6 output records; each were theoretically
erroneous results (the SSRs really were there in the fasta sequence,
but the fasta sequence wasn't valid). 1 additional output record was
added by using the custom parameters because the fasta sequence was
below the minimum length for processing under the default parameters.
IV. Funding and Acknowledgements
-------------------------------
Funding for the research and production of this software was provided by
startup funds to Dr. Perry Ridge.
V. Contact
-----------
For questions, comments, concerns, feature requests, suggestions, etc., please
contact:
Pery Ridge, Ph.D. -- perry.ridge@byu.edu
Note: For usage questions, please consult section `III. Usage Instructions and
Examples' first.