Skip to content

Tutorial Curation Ecoli LTEE

Jeffrey Barrick edited this page Jul 20, 2024 · 6 revisions

This tutorial illustrates some additional advanced curation topics through analyzing genomes of E. coli from the long-term evolution experiment (LTEE).

The long duration of this study and the large number of genomes that have been sequenced from it lead to some rare (and more challenging) cases occasionally cropping up. Because of this, we have created some scripts that help us curate these genomes.

Clone and set up the LTEE-Ecoli repository

To get started with the workflow, you need to clone the LTEE-Ecoli GitHub repository.

git clone https://github.com/barricklab/LTEE-Ecoli.git

Now create the ltee-ecoli Conda environment from the included environment.yml file.

cd LTEE-Ecoli
mamba create -f environment.yml

Note

The LTEE-Ecoli repository uses specific versions of breseq to ensure compatibility with the GenomeDiff format and gdtools utility commands.

In this tutorial, we will run the LTEE-Ecoli script from where they reside in the LTEE-Ecoli/bin directory by including their relative paths on the command-line. You could add this location to your $PATH if you want to run them by name.

Copy input GenomeDiff files

Create a curation directory inside of the main repository directory for all of your in-progress work. (It's also fine to create this anywhere else. Just, in that case, be sure you are giving the full path to the script when running them.)

mkdir curation
cd curation

Now, create a folder called 01_breseq_initial_gd within the curation directory:

mkdir 01_breseq_initial_gd

Guess what you put here? Correct! The GenomeDiff files directly output by your breseq runs of new sample. Let's say that is under my/brefito/run/breseq-references/gd. Then you could use this command:

cp my/brefito/run/breseq-references/gd/*.gd 01_breseq_initial_gd

You should also copy over some already curated GenomdDiff files from the same LTEE population. You can find these in the main repository folder LTEE-clone-curated.

You could copy over all of the ones from population A-5 this way:

cp ../LTEE-clone-curated/A-5*.gd 01_breseq_initial_gd

Note

You might want to only copy over a subset of them that is spread over generations at first, because the pipeline will take longer the more you use.

Initialize the curation directory

Run this LTEE-Ecoli command from your curation directory with the ltee-ecoli Conda environment activated.

../bin/population.sh init

This generates new directories (00_header, 02_curate_add, 02_curate_remove) with empty copies of your GenomeDiff files in them. We are going to edit these to do our annotation!

Edit the header GenomeDiff files

Now edit the newly created 00_header files for your genomes so they have additional metadata, such as TREATMENT, TIME, POPULATION, CLONE, AND , MUTATOR_STATUS.

#=GENOME_DIFF	1.0
#=TITLE	Ara-5_10000gen_4540A
#=AUTHOR	<yourname>
#=TIME	10000
#=POPULATION	Ara-5
#=TREATMENT	LTEE
#=CLONE	A
#=MUTATOR_STATUS	non-mutator
#=REFSEQ	https://raw.githubusercontent.com/barricklab/LTEE/7da91974eafac0c5a8f903ae57275795d4395af2/reference/REL606.gbk
#=READSEQ	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/001/SRR2589061/SRR2589061_1.fastq.gz
#=READSEQ	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/001/SRR2589061/SRR2589061_2.fastq.gz

Test your curation

Now run the main population pipeline.

../bin/population.sh

This will print out a ton of run information to the command line and create many output files.

The ones that are most immediately useful are these:

  • 04_final_normalized_gd - contains the GenomeDiff files that are fully curated. Use these with gdtools APPLY to test your curation. Once they look complete, use them for your analyses.
  • compare_normalized.html - the HTML output of gdtools COMPARE on the fully curated GenomeDiffs.
  • 07_phylogeny/tree.rerooted.tre - Newick format phylogenetic tree that can be loaded in programs like https://github.com/rambaut/figtree/releases.

Most of the rest of the files are related to different ways of counting or not counting mutations.

Note

What's normalization? Some mutations, such as a deletion of an A in a run of AAAAA could be annotated with multiple positions. gdtools NORMALIZE is used to make all of these cases match, which is important for deciding whether the same mutation is present in multiple genomes.

Note

What's masking? Some of the other files and directories mention "masking". What this means is that mutations in certain regions of the genome are not counted. Why? If you have a mixture of some datasets with longer reads and some with short reads, the short-read datasets will miss some mutations in repetitive regions that the others can fine. This leads to unequal counting of mutations and disrupts a phylogenetic tree. For the LTEE, we use a masking file that asks what regions of the genome one would be able to call mutations in with 36-bp reads (the shortest in any dataset).

Sometimes it can be hard to tell if

Edit the curate add and remove GenomeDiff files

Based on looking at unassigned evidence in your breseq results, you will add mutations to the GenomeDiff file, as described in the rest of this tutorial.

Sometimes you will divide up a mutation predicted by breseq into several mutations that happen one after another such that, in the end, they create the same final change to the genome.

Other times breseq might have a false-positive prediction of a mutation.

These are the cases when you will need to copy a mutation line from your original GenomeDiff file to the one with the same name in 02_curate_remove.

The compare_normalized.html file is great for finding when later genomes have mutations that hide earlier mutations or that are described in a different way. If you entered TIME and CLONE metadate in your headers, the columns will be sorted by this information.

The directory 07_phylogeny/discrepancies contains Newick tree files for all mutations that don't agree with the overall phylogenetic tree.

Both files are helpful for noticing earlier mutations that were deleted in a later genome and need to be marked with deleted=1, mutations that were otherwise modified and need to be broken down into their constituent parts, and sometimes mutations that were just missed by breseq and can be found upon further examination of the sequencing data (for example with breseq BAM2COV, breseq BAM2ALN, or IGV).

Test your curation!

Copy the files from 04_final_normalized_gd back to where you ran brefito into the genome-diffs folder in your main directory there. Now you can use brefito to automate checking your annotations using gdtools APPLY and re-running breseq_ against the mutant genomes.

You might get tired of copying over the annotated GenomeDiff files after every cycle of curation. You could create links (ln -s) from files you are curating in the brefito genome-diffs folder to the 04_final_normalized_gd so that they are automatically in sync both places.