Skip to content

1. Barcode Identification

Peter edited this page Nov 28, 2019 · 11 revisions

Overview

Trimming

The first step of the SPRITE pipeline is to remove Illumina sequencing adaptors from paired end reads using Trim Galore!

The next step of the SPRITE pipeline is to identify the barcodes of the sequenced reads. This process is run by a standalone Java program that accepts two input FASTQ files (paired-end sequencing), and outputs modified versions of these FASTQ files for subsequent alignment. See below for more details.

Usage

java -jar BarcodeIdentification.jar \
--input1 read1.fastq.gz \
--input2 read2.fastq.gz \
--output1 read1.barcoded.fastq.gz \
--output2 read2.barcoded.fastq.gz \
--config configuration_file.txt

Input and output FASTQ files

The inputs to this program are the two FASTQ files from a paired-end sequencing run. These files can be gzipped or uncompressed. The outputs are the same FASTQ files with the identified tags appended to the read name. The output files will be gzipped.

Example input:

@HISEQ:623:HY5KHBCXX:2:2206:7231:7108
ATTGNTAGGTCGGAATTGCACGCTGTAGCGGCATGCTGATGGAGAGGAGAGACTTCTAGCTAGCTACGTGACTGATCCGCACACTGCGACACGTGATCGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Example output:

@HISEQ:623:HY5KHBCXX:2:2206:7231:7108::[DPM6A5][NYBot35_Stg][NOT_FOUND][Even2Bo19][Odd2Bo69]
ATTGNTAGGTCGGAATTGCACGCTGTAGCGGCATGCTGATGGAGAGGAGAGACTTCTAGCTAGCTACGTGACTGATCCGCACACTGCGACACGTGATCGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

In the above example, four of five tags were successfully identified:

  • a DPM sequence named "DPM6A5"
  • a Y-shape sequence named "NYBot35_Stg"
  • an even sequence named "Even2Bo19"
  • an odd sequence named "Odd2Bo69"

No sequence in position three was found, so a NOT_FOUND tag was inserted. This is the standard behavior when a position cannot be identified.

Configuration file

The configuration file specifies the base sequence of every possible tag, as well as the ordering of the tags within a valid barcode. This is best explained by example. One of the first tag ligation schemes or "layouts" used in our lab is the following:

tag layout

Not shown are sticky-end sequences. The first thirty lines of configuration file used to analyze this layout is below. The complete file can be found here.

######################################
# EXAMPLE CONFIG FILE FOR BARCODE ID #
######################################

READ1 = DPM
READ2 = Y|SPACER|ODD|SPACER|EVEN|SPACER|ODD

############
# OPTIONAL #
############

# Length of spacer. Defaults to 6
# SPACER = 6

# The maximum number of bases to look ahead if no tag is initially found.
# Defaults to 6.
# LAXITY = 6

########
# TAGS #
########

# Columns:
# CATEGORY NAME SEQUENCE NUM_MISMATCHES

# Valid categories are "EVEN", "ODD", "Y", "RPM", "DPM", "LIGTAG"

EVEN	Even2Bo1	ATACTGCGGCTGACG	2
EVEN	Even2Bo5	CTAGGTGGCGGTCTG	2
EVEN	Even2Bo2	GTGACATTAAGGTTG	2

The first few lines specify the layout in both read 1 and read 2. Tag categories are separated by pipes. The valid categories (DPM, EVEN, ODD, etc.) have no implicit meaning, and are just used to divide tags into categories. The SPACER category is a special category to represent sticky-ends. The identification behavior at a sticky-end will be explained later, along with the SPACER and LAXITY variables found in the optional section.

The tag section consists of many tab-delimited lines. Each line has four fields and represents a unique tag sequence. The fields, in order, are

  • the tag category (DPM, EVEN, ODD, etc)
  • the tag name
  • the tag sequence
  • the tag error tolerance

The tag error tolerance field allows for some tags to have base mismatches or miscalls. A value of zero means to only accept the specified sequence as a match; a value of one means to accept any sequence within one Hamming distance of the specified sequence as a match; and so on for two, three, etc.

Step-by-step

With the example configuration file, the program will do the following:

  1. The program starts at the 5'-end of read 1 and searches for a valid DPM sequence. Specifically, all DPM sequences are eight bases long, so the program does a hashmap lookup of the first eight bases in read 1. No mismatches are tolerated in the DPM sequence because the error tolerance in all of the DPM configuration-file lines is zero. If a corresponding DPM tag is found, its name is appended to the FASTQ record's name in square brackets. Otherwise, "[NOT_FOUND]" is appended. In either case, the program continues.

  2. There are no more tags in read 1, so the program continues to read 2.

  3. The program starts at the 5'-end of read 2 and searches for a valid Y-shape sequence. The Y-shape tag varies from nine to twelve bases, so the program does successive hashmap lookups of the first nine, ten, eleven and twelve bases of read 2. Like the DPM sequences, the Y-shape sequences cannot have any mismatches. The name of the found sequence is appended to the FASTQ record name, and the program continues.

  4. Following the Y-shape tag is a sticky-end, or SPACER element. The SPACER variable defaults to 6, so the program skips the next six bases after the end of the Y tag.

  5. After skipping the six bases, the program looks for an ODD tag. If one is found at that position, its name is appended to the FASTQ name. If one isn't found, one base is skipped (in this case, the seventh base after the end of the Y tag) and the search repeats. Again, if one isn't found, the search position advances an additional base. This process repeats a number of times equal to the LAXITY variable. With SPACER = 6 and LAXITY = 6, the program will search at positions 7, 8, 9, 10, 11, 12 and 13 bases past the end of the Y tag.

Filtering barcodes

Reads that do not contain a full complement of barcodes are filtered out at this stage using the get_full_barcodes.py script:

python get_full_barcodes.py --r1 example.R1.barcoded.fastq.gz

The output of this script is two fastq files, one with reads containing the full complement of barcodes example.R1.barcoded_full.fastq.gz and the other containing everything else example.R1.barcoded_short.fastq.gz.

DPM trimming

In most cases, the DPM identity is derived from read 1, while the remaining barcodes come from read 2. As a result, the residual DPM sequences that would hinder alignment can only be removed after barcode identification with a second round of trimming using Cutadapt. Cutadapt searches for the common 5’ DPM sequence (GATCGGAAGAG), as well as the 96 barcoded 3’ DPM sequence (dpm96.fasta), accounting for cases where read 1 extends through the genomic DNA sequence into the SPRITE tags ligated on the 3’end of the fragment.