-
Notifications
You must be signed in to change notification settings - Fork 20
Tutorial Curation Ecoli LTEE
Back to the Main Curation Tutorial Page
This tutorial illustrates some advanced curation topics through analyzing genomes from the E. coli Long-Term Evolution Experiment (LTEE).
The long duration of this experiment and the large number of genomes that have been sequenced from it lead to some rare (and more challenging) cases occasionally cropping up dur to how many mutations have accumulated. Because of this, we have created some extra utilites to help us curate these genomes.
To get started with the workflow, you need to clone the LTEE-Ecoli GitHub repository.
git clone https://github.com/barricklab/LTEE-Ecoli.git
Now create the ltee-ecoli
Conda environment from the included environment.yml
file.
cd LTEE-Ecoli
mamba create -f environment.yml
Note
The LTEE-Ecoli repository uses a specific version of breseq (often not the newest one) to ensure compatibility between GenomeDiff files and gdtools utility commands.
In this tutorial, we will run the LTEE-Ecoli script from where they reside in the
LTEE-Ecoli/bin
directory by including their relative paths on the command-line.
You could add this location to your $PATH
if you want to run them by name.
Create a "curation directory" inside of the main repository directory for all of your in-progress work. (It's also fine to create this anywhere else. Just, in that case, be sure you are giving the full path to the script when running them.)
You MUST name your "curation directory" in the format Ara+*
or Ara-*
(for example, Ara-3
). This lets the pipeline determine the correct ancestor genome to use.
mkdir Ara-3
cd Ara-3
Now, create a folder called 01_breseq_initial_gd
within the "curation directory" (Ara-3
in the example):
mkdir 01_breseq_initial_gd
Guess what you put here? Correct! The GenomeDiff files directly output by your breseq
runs of new sample. Let's say those are located under my/brefito/run/breseq-references/gd
.
Then you could use this command to copy them here:
cp my/brefito/run/breseq-references/gd/*.gd 01_breseq_initial_gd
You should also copy over some already curated GenomdDiff files from the same LTEE population.
You can find these in the main repository folder LTEE-clone-curated
.
You could copy over all of the ones from population A-5 this way:
cp ../LTEE-clone-curated/A-5*.gd 01_breseq_initial_gd
Note
You might want to only copy over a subset of them that is spread over generations at first, because the pipeline will take longer the more you use, but for final curation you should use them all.
Run this LTEE-Ecoli command from your curation directory with the ltee-ecoli
Conda environment activated.
../bin/population.sh init
This generates new directories (00_header
, 02_curate_add
, 02_curate_remove
) with empty copies
of your GenomeDiff files in them. We are going to edit these to do our annotation!
Now edit the newly created 00_header
files for your genomes so they have additional metadata,
such as TREATMENT
, TIME
, POPULATION
, CLONE
, and MUTATOR_STATUS
.
Here's an example you can work from:
#=GENOME_DIFF 1.0
#=TITLE Ara-5_10000gen_4540A
#=AUTHOR <yourname>
#=TIME 10000
#=POPULATION Ara-5
#=TREATMENT LTEE
#=CLONE A
#=MUTATOR_STATUS non-mutator
#=REFSEQ https://raw.githubusercontent.com/barricklab/LTEE/7da91974eafac0c5a8f903ae57275795d4395af2/reference/REL606.gbk
#=READSEQ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/001/SRR2589061/SRR2589061_1.fastq.gz
#=READSEQ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/001/SRR2589061/SRR2589061_2.fastq.gz
We are going to use a script that can help you validate your curation with some specific additions/settings for working with LTEE populations.
First, let's run the population curation helper script before you get started doing any manual curation:
../bin/population.sh
This will print out a ton of run information to the command line and create many output files.
The ones that are most immediately useful are these:
-
04_final_normalized_gd
- contains the GenomeDiff files that are fully curated. Use these withgdtools APPLY
to test your curation. Once they look complete, use them for your analyses. -
compare_normalized.html
- the HTML output ofgdtools COMPARE
on the fully curated GenomeDiffs. -
07_phylogeny/tree.rerooted.tre
- Newick format phylogenetic tree that can be viewed in programs like https://github.com/rambaut/figtree/releases.
Most of the rest of the files are related to different ways of counting or not counting mutations.
Note
What's normalization?
Some mutations, such as a deletion of an A
in a run of AAAAA
could be annotated with
multiple positions. gdtools NORMALIZE
is used to make all of these cases match, which
is important for deciding whether the same mutation is present in multiple genomes.
Note
What's masking? Some of the other files and directories mention "masking". What this means is that mutations in certain regions of the genome are not counted. Why? If you have a mixture of some datasets with longer reads and some with short reads, the short-read datasets will miss some mutations in repetitive regions that the others can fine. This leads to unequal counting of mutations and disrupts a phylogenetic tree. For the LTEE, we use a masking file that asks what regions of the genome one would be able to call mutations in with 36-bp reads (the shortest in any dataset).
Based on looking at unassigned evidence in your breseq results, you will add mutations
to the 02_curate_add
GenomeDiff file, as described in the rest of this tutorial. Sometimes you will divide up a mutation predicted by breseq into several mutations that happen one after another such that, in the end, they create the same final change to the genome. Other times breseq might have a false-positive prediction of a mutation. These are the cases when you will need to copy a mutation line from your original GenomeDiff file to the one with the same name in 02_curate_remove
.
The compare_normalized.html
file is great for finding when later genomes have mutations that
hide earlier mutations or that are described in a different way. If you entered TIME
and CLONE
metadate in your headers, the columns will be sorted by this information.
The directory 07_phylogeny/discrepancies
contains Newick tree files
for all mutations that don't agree with the overall phylogenetic tree.
Both files are helpful for noticing earlier mutations that were deleted in a later genome
and need to be marked with deleted=1
, mutations that
were otherwise modified and need to be broken down into their constituent parts, and sometimes
mutations that were just missed by breseq and can be found upon further examination of the
sequencing data (for example with breseq BAM2COV
, breseq BAM2ALN
, or IGV).
Inescapably, you'll introduce some typos and other inconsistencies into your 02_curate_add
and 02_curate_remvoe
GenomeDiff files as you edit them that cause them to fail validation. You can quickly check for formatting problems in all of the file you are editing using this command:
gdtools VALIDATE 02*/*.gd
Copy the files from 04_final_normalized_gd
back to where you ran brefito into the
genome-diffs
folder in your main directory there. Now you can use brefito to automate
checking your annotations using gdtools APPLY
and re-running breseq against the mutant genomes.
You might get tired of copying files from your curation
directory to where you are running brefito every cycle of curation. One solution is to create symbolic links from the mutant sequences generated by gdtools APPLY
in the curation
directory to where you run your brefito workflows. Let's call that location /complete/path/to/brefito
in the example commands for doing this below.
mkdir /complete/path/to/brefito/mutants
cd mutated_genomes
../../bin/batch_run.pl -0 "ln -s $PWD/#d /complete/path/to/brefito/mutants/#d"
Caution
If you do this, be sure that you use ABSOLUTE versus RELATIVE paths for the symlinks. Snakemake will not recognize relative links as input files. It will be like they don't exist.
Quick Start
Installation
Test Drive
More Options
Usage: breseq
Usage: gdtools
More Information
GenomeDiff File Format
Reference Sequence File Formats
Output
Methods
Bibliography
FAQ
More Examples
Tutorial: Clones
Tutorial: Populations
Tutorial: Barcoded/Targeted
Tutorial: Curation
Contribute
Developer