Skip to content

Tutorial Curation Ecoli LTEE

Jeffrey Barrick edited this page Aug 14, 2024 · 6 revisions

Back to the Main Curation Tutorial Page

This tutorial illustrates some advanced curation topics through analyzing genomes from the E. coli Long-Term Evolution Experiment (LTEE).

The long duration of this experiment and the large number of genomes that have been sequenced from it lead to some rare (and more challenging) cases occasionally cropping up dur to how many mutations have accumulated. Because of this, we have created some extra utilites to help us curate these genomes.

Clone and set up the LTEE-Ecoli repository

To get started with the workflow, you need to clone the LTEE-Ecoli GitHub repository.

git clone https://github.com/barricklab/LTEE-Ecoli.git

Now create the ltee-ecoli Conda environment from the included environment.yml file.

cd LTEE-Ecoli
mamba create -f environment.yml

Note

The LTEE-Ecoli repository uses a specific version of breseq (often not the newest one) to ensure compatibility between GenomeDiff files and gdtools utility commands.

In this tutorial, we will run the LTEE-Ecoli script from where they reside in the LTEE-Ecoli/bin directory by including their relative paths on the command-line. You could add this location to your $PATH if you want to run them by name.

Copy input GenomeDiff files

Create a "curation directory" inside of the main repository directory for all of your in-progress work. (It's also fine to create this anywhere else. Just, in that case, be sure you are giving the full path to the script when running them.)

You MUST name your "curation directory" in the format Ara+* or Ara-* (for example, Ara-3). This lets the pipeline determine the correct ancestor genome to use.

mkdir Ara-3
cd Ara-3

Now, create a folder called 01_breseq_initial_gd within the "curation directory" (Ara-3 in the example):

mkdir 01_breseq_initial_gd

Guess what you put here? Correct! The GenomeDiff files directly output by your breseq runs of new sample. Let's say those are located under my/brefito/run/breseq-references/gd.

Then you could use this command to copy them here:

cp my/brefito/run/breseq-references/gd/*.gd 01_breseq_initial_gd

You should also copy over some already curated GenomdDiff files from the same LTEE population. You can find these in the main repository folder LTEE-clone-curated.

You could copy over all of the ones from population A-5 this way:

cp ../LTEE-clone-curated/A-5*.gd 01_breseq_initial_gd

Note

You might want to only copy over a subset of them that is spread over generations at first, because the pipeline will take longer the more you use, but for final curation you should use them all.

Initialize the curation directory

Run this LTEE-Ecoli command from your curation directory with the ltee-ecoli Conda environment activated.

../bin/population.sh init

This generates new directories (00_header, 02_curate_add, 02_curate_remove) with empty copies of your GenomeDiff files in them. We are going to edit these to do our annotation!

Edit the header GenomeDiff files

Now edit the newly created 00_header files for your genomes so they have additional metadata, such as TREATMENT, TIME, POPULATION, CLONE, and MUTATOR_STATUS.

Here's an example you can work from:

#=GENOME_DIFF	1.0
#=TITLE	Ara-5_10000gen_4540A
#=AUTHOR	<yourname>
#=TIME	10000
#=POPULATION	Ara-5
#=TREATMENT	LTEE
#=CLONE	A
#=MUTATOR_STATUS	non-mutator
#=REFSEQ	https://raw.githubusercontent.com/barricklab/LTEE/7da91974eafac0c5a8f903ae57275795d4395af2/reference/REL606.gbk
#=READSEQ	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/001/SRR2589061/SRR2589061_1.fastq.gz
#=READSEQ	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/001/SRR2589061/SRR2589061_2.fastq.gz

Checking your curation of an LTEE population

We are going to use a script that can help you validate your curation with some specific additions/settings for working with LTEE populations.

First, let's run the population curation helper script before you get started doing any manual curation:

../bin/population.sh

This will print out a ton of run information to the command line and create many output files.

The ones that are most immediately useful are these:

  • 04_final_normalized_gd - contains the GenomeDiff files that are fully curated. Use these with gdtools APPLY to test your curation. Once they look complete, use them for your analyses.
  • compare_normalized.html - the HTML output of gdtools COMPARE on the fully curated GenomeDiffs.
  • 07_phylogeny/tree.rerooted.tre - Newick format phylogenetic tree that can be viewed in programs like https://github.com/rambaut/figtree/releases.

Most of the rest of the files are related to different ways of counting or not counting mutations.

Note

What's normalization? Some mutations, such as a deletion of an A in a run of AAAAA could be annotated with multiple positions. gdtools NORMALIZE is used to make all of these cases match, which is important for deciding whether the same mutation is present in multiple genomes.

Note

What's masking? Some of the other files and directories mention "masking". What this means is that mutations in certain regions of the genome are not counted. Why? If you have a mixture of some datasets with longer reads and some with short reads, the short-read datasets will miss some mutations in repetitive regions that the others can fine. This leads to unequal counting of mutations and disrupts a phylogenetic tree. For the LTEE, we use a masking file that asks what regions of the genome one would be able to call mutations in with 36-bp reads (the shortest in any dataset).

Edit the curate add and remove GenomeDiff files

Based on looking at unassigned evidence in your breseq results, you will add mutations to the 02_curate_add GenomeDiff file, as described in the rest of this tutorial. Sometimes you will divide up a mutation predicted by breseq into several mutations that happen one after another such that, in the end, they create the same final change to the genome. Other times breseq might have a false-positive prediction of a mutation. These are the cases when you will need to copy a mutation line from your original GenomeDiff file to the one with the same name in 02_curate_remove.

The compare_normalized.html file is great for finding when later genomes have mutations that hide earlier mutations or that are described in a different way. If you entered TIME and CLONE metadate in your headers, the columns will be sorted by this information.

The directory 07_phylogeny/discrepancies contains Newick tree files for all mutations that don't agree with the overall phylogenetic tree.

Both files are helpful for noticing earlier mutations that were deleted in a later genome and need to be marked with deleted=1, mutations that were otherwise modified and need to be broken down into their constituent parts, and sometimes mutations that were just missed by breseq and can be found upon further examination of the sequencing data (for example with breseq BAM2COV, breseq BAM2ALN, or IGV).

Inescapably, you'll introduce some typos and other inconsistencies into your 02_curate_add and 02_curate_remvoe GenomeDiff files as you edit them that cause them to fail validation. You can quickly check for formatting problems in all of the file you are editing using this command:

gdtools VALIDATE 02*/*.gd

Test your curation!

Copy the files from 04_final_normalized_gd back to where you ran brefito into the genome-diffs folder in your main directory there. Now you can use brefito to automate checking your annotations using gdtools APPLY and re-running breseq against the mutant genomes.

You might get tired of copying files from your curation directory to where you are running brefito every cycle of curation. One solution is to create symbolic links from the mutant sequences generated by gdtools APPLY in the curation directory to where you run your brefito workflows. Let's call that location /complete/path/to/brefito in the example commands for doing this below.

mkdir /complete/path/to/brefito/mutants
cd mutated_genomes
../../bin/batch_run.pl -0 "ln -s $PWD/#d /complete/path/to/brefito/mutants/#d"

Caution

If you do this, be sure that you use ABSOLUTE versus RELATIVE paths for the symlinks. Snakemake will not recognize relative links as input files. It will be like they don't exist.