Encoding example
Create a compressed tar archive:
tar -b1 -czvf info_to_code.tar.gz ./info_to_code/
Zero-padding to make the input a multiple of 512bytes
truncate -s2116608 ./info_to_code.tar.gz
Or download the original archive:
wget http://files.teamerlich.org/dna_fountain/dna-fountain-input-files.tar.gz
Actual encoding of data as DNA (output is a FASTA file):
python encode.py \
--file_in info_to_code.tar.gz \
--size 32 \
-m 3 \
--gc 0.05 \
--rs 2 \
--delta 0.001 \
--c_dist 0.025 \
--out info_to_code.tar.gz.dna \
--stop 72000
Add annealing sites:
cat info_to_code.tar.gz.dna | grep -v '>' |\
awk '{print "GTTCAGAGTTCTACAGTCCGACGATC"$0"TGGAATTCTCGGGTGCCAAGG"}' \
> info_to_code.tar.gz.dna_order
Output file is ready to order synthetic DNA.
Decoding example
Convert BCL to FASTQ using picard (https://github.com/broadinstitute/picard):
for i in {1101..1119} {2101..2119}; do
mkdir ~/Downloads/fountaincode/seq_data3/$i/;
done
for i in {1101..1119} {2101..2119}; do
java -jar ~/Downloads/picard-tools-2.5.0/picard.jar \
IlluminaBasecallsToFastq \
BASECALLS_DIR=./raw/19854859/Data/Intensities/BaseCalls/ \
LANE=1 \
OUTPUT_PREFIX=./seq_data3/$i/ \
RUN_BARCODE=19854859 \
MACHINE_NAME=M00911 \
READ_STRUCTURE=151T6M151T \
FIRST_TILE=$i \
TILE_LIMIT=1 \
FLOWCELL_BARCODE=AR4JF;
done
(Sequencing data are available at www.ebi.ac.uk/ena/data/view/PRJEB19305 and www.ebi.ac.uk/ena/data/view/PRJEB19307)
Read stitching using PEAR (Zhang J et al., Bioinformatics, 2014). This step takes the 150nt reads and places them together to get back the full oligo.
for i in {1101..1119} {2101..2119}; do
pear -f ./$i.1.fastq -r ./$i.2.fastq -o $i.all.fastq;
done
Retain only fragments with 152nt (the original oligo size):
awk '(NR%4==2 && length($0)==152){print $0}' *.all.fastq.assembled.fastq > all.fastq.good
Sort to prioritize highly abundant reads:
sort -S4G all.fastq.good | uniq -c > all.fastq.good.sorted
gsed -r 's/^\s+//' all.fastq.good.sorted |\
sort -r -n -k1 -S4G > all.fastq.good.sorted.quantity
Exclude column 1 specifying the number of times a read was seen and exclude reads with N:
cut -f2 -d' ' all.fastq.good.sorted.quantity |\
grep -v 'N' > all.fastq.good.sorted.seq
# Decoding:
python ~/Downloads/fountaincode/receiver.py \
-f ./seq_data3/all.fastq.good.sorted.seq \
--header_size 4 \
--rs 2 \
--delta 0.001 \
--c_dist 0.025 \
-n 67088 \
-m 3 \
--gc 0.05 \
--max_hamming 0 \
--out decoder.out.bin
checksum verification:
md5 decoder.out.bin
expected output is 8651e90d3a013178b816b63fdbb94b9b
md5 info_to_code.tar.gz
expected output is 8651e90d3a013178b816b63fdbb94b9b