This tool allows for an easy to use analysis of synthetic oligo libraries.
- SOLQC Overview
- Setup
- Preparation (What you need)
- Usage
- Output
- Configuration File
- Example using the toy data
The solqc is a tool for statistical analysis of synthetic oligo libraries.
Given a list of designed sequences and list of sequenced reads that were generated from the designed sequences the SOLQC tool will output a statistical analysis report of the synthesized sequences.
The tool's pipline is as follows:
- Preprocessing : Iterate over the sequnced reads of the library and filter out reads that do not match certain parameters (prefix, length, etc...)
- Matching : Matching between the reads and the variants.
- Alignment : Aligning each read to his matched variant.
- Analysis : Analyzing the alignment and matching results.
In it's current state we assume the user as some familiarity with python.
You'll need to run the tool with python 3.6.5.
Start by cloning the repository to a local directory.
Next, open a command line tool, go to the root folder and run:
pip install -r requirements.txt
This will install all the necessary modules to run the tool.
In order to use the tool you'll need the following:
- Design, could one of 2 options:
- A design file, in a csv format containing 2 columns : [barcode, variant]
- barcode - a sequence identifier for the variant. [Needed for matching between a read and a variant].
- variant - the complete variant sequence. [Needed for the alignment to analyse missmatches and indel's.
- IUPAC string
- A design file, in a csv format containing 2 columns : [barcode, variant]
- A reads text file containing all the fasta/q files names of the sequenced read (one row for each file).
- A config.json file containing different possible configuration, see - configuration
Here is an example for each of those files:
data/my_data/reads_1.fastq
data/my_data/reads_2.fastq
{
"prefix" : "ACAACGCTTTCTGTGTCGTG",
"suffix" : "",
"length" : 0,
"barcode_start" : 20,
"barcode_end" : 32,
}
Open a command line and to go the root folder and run:
python main.py -d <path_to_design>/design.csv -r <path_to_read>/reads.txt -c <path_to_config>/config.json
Or if you are using IUPAC string instead of a design:
python main.py -d "IUPAC_string" -r <path_to_read>/reads.txt -c <path_to_config>/config.json
- --no-edit(flag) : If you don't want to prefrom alignemnt between the reads and variants (highly recommended if you don't want to perform any related analysis as it saves a lot of running time.
- --edit(flag) : If you want to prefrom alignemnt [Default]
- -a (str array) : Allows the specification of different matching startegies. Currently only one matching is implemented.
- -id (str): Will prefix outputed files (relevant if you want to run multiples run and not erase old output).
- More parameters will come soon!
We will soon allow the setting of different analysis on the library from the command line but
currently you'll need to go main.py and choose them yourself.
Go to line 139 and choose the the desired analysis. (you can see all of them in the analyzer.py file).
analyzers = AnalyzerFactory.create_analyzers([AnalyzersNames.MATCHING_ANALYZER,
AnalyzersNames.FREQUENCY_ANALYZER
])
Once the tool is done you can find the analysis results under a deliverable folder.
In order to run the tool you must supply a config file to the program. This should be a json file containing the following parameters:
{
"prefix" : "ACAACGCTTTCTGTGTCGTG",
"suffix" : "",
"length" : 0,
"barcode_start" : 20,
"barcode_end" : 32,
}
- prefix : If supplied will remove all reads not starting with the supplied sequence.
- suffix : If supplied will remove all reads not ending with the supplied sequence.
- length : If supplied and set to a value above 0 will only leave reads with
length - 5 <= len(read) <= length + 5 - barcode_start : Start position of the barcode.
- barcode_end : End position of the barcode.
We recommend running the tool with the toy data supplied with this repository.
This will give you a sense of how to use the tool with a relative small sized data, so it will run the entire analysis
in less than 30 seconds.
After you setup the tool simply run :
-d data/toy_data/design.csv -r data/toy_data/reads.txt -c data/toy_data/config.json