Skip to content

GenomeDiff File Format

Jeffrey Barrick edited this page Aug 11, 2024 · 3 revisions

The GenomeDiff file format describes mutational differences between a reference DNA sequence and a sample. It may also include evidence from computational analysis or experiments that supports mutations.

An example of a portion of a file:

#=GENOME_DIFF 1.0
DEL  61  11  NC_001416   139 1
INS  62  12  NC_001416   14266   G
SNP  63  13  NC_001416   20661   G
INS  64  14  NC_001416   20835   C
SNP  65  15  NC_001416   21714   A
DEL  60  33,1    NC_001416   21738   5996
SNP  66  35  NC_001416   31016   C
...
MC   9       NC_001416   1   2   0   0   left_inside_cov=0   left_outside_cov=NA right_inside_cov=0  right_outside_cov=169
RA   11      NC_001416   139 0   G   .   frequency=1 new_cov=34/40   quality=309.0   ref_cov=0/0 tot_cov=34/40
JC   2       NC_001416   5491    1   NC_001416   30255   1   0   alignment_overlap=4 coverage_minus=8    coverage_plus=0 flanking_left=35    flanking_right=35   key=NC_001416__5491__1__NC_001416__30251__1__4____35__35__0__0  max_left=30 max_left_minus=30   max_left_plus=0 max_min_left=0  max_min_left_minus=0    max_min_left_plus=0 max_min_right=11    max_min_right_minus=11  max_min_right_plus=0    max_right=11    max_right_minus=11  max_right_plus=0    min_overlap_score=44    pos_hash_score=7    reject=NJ,COV   side_1_annotate_key=gene    side_1_overlap=4    side_1_redundant=0  side_2_annotate_key=gene    side_2_overlap=0    side_2_redundant=0  total_non_overlap_reads=8   total_reads=8
JC   3       NC_001416   13180   1   NC_001416   13218   1   0   alignment_overlap=4 coverage_minus=1    coverage_plus=0 flanking_left=35    flanking_right=35   key=NC_001416__13180__1__NC_001416__13214__1__4____35__35__0__0 max_left=17 max_left_minus=17   max_left_plus=0 max_min_left=0  max_min_left_minus=0    max_min_left_plus=0 max_min_right=14    max_min_right_minus=14  max_min_right_plus=0    max_right=14    max_right_minus=14  max_right_plus=0    min_overlap_score=14    pos_hash_score=1    reject=NJ,COV   side_1_annotate_key=gene    side_1_overlap=4    side_1_redundant=0  side_2_annotate_key=gene    side_2_overlap=0    side_2_redundant=0  total_non_overlap_reads=1   total_reads=1
RA   12      NC_001416   14266   1   .   G   frequency=1 new_cov=44/31   quality=186.3   ref_cov=0/0 tot_cov=44/31
JC   5       NC_001416   14869   -1  NC_001416   15609   -1  0   alignment_overlap=7 coverage_minus=1    coverage_plus=0 flanking_left=35    flanking_right=35   key=NC_001416__14869__0__NC_001416__15616__0__7____35__35__0__0 max_left=21 max_left_minus=21   max_left_plus=0 max_min_left=0  max_min_left_minus=0    max_min_left_plus=0 max_min_right=7 max_min_right_minus=7   max_min_right_plus=0    max_right=7 max_right_minus=7   max_right_plus=0    min_overlap_score=7 pos_hash_score=1    reject=NJ,COV   side_1_annotate_key=gene    side_1_overlap=7    side_1_redundant=0  side_2_annotate_key=gene    side_2_overlap=0    side_2_redundant=0  total_non_overlap_reads=1   total_reads=1

Format specification

Version line

The first line of the file must define that this is a file and the version of the file specification used:

#=GENOME_DIFF 1.0

Only version 1.0 is defined.

Metadata lines

Lines beginning with #=<name> <value> are interpreted as metadata. (Thus, the first line is assigning a metadata item named GENOME_DIFF a value of 1.0.) Names cannot include whitespace characters. Values may include whitespace characters. In most cases, the values for lines with the same name are concatenated with single spaces added between them or interpreted as a list.

Common but optional metadata fields include:

TITLE The name of the sample. If this field is not provided, the name of the file (removing the .gd suffix) is used for this field.

AUTHOR Name of person who curated the file.

PROGRAM
Name and version of software program that generated the file.

CREATED
Date on which the file was created.

TIME
Time point the sample is from, in days, generations, or any other unit of measurement. Ex: 1, 2, 15000

POPULATION
Name/designation for the population the sample is from. Ex: Ara–3 / MA-1

TREATMENT
Experimental treatment group for this population. Ex: LB medium / LTEE

CLONE
Name/designation for a clonal isolate, Ex: A, B, REL10863

REFSEQ
Location of the reference sequence file. Ex: /here/is/an/absolute/path/to/the/file.gb

ADAPTSEQ
Location of the adaptor sequence file. Ex: relative/path/to/the/adaptors.fa

READSEQ
Location of the read sequence file. Ex: https://place.org/url/for/file/download.fastq

REFSEQ, ADAPTSEQ, and READSEQ entries are interpreted as lists of multiple files to process/include if they appear multiple times. The location value for theese fields can be a URL or an absolute/relative file path.

Some of these metadata fields are used to name and sort samples by gdtools COMPARE and by other utilities.

Comment lines

Lines beginning with whitespace and # are comments. Comments may not occur at the end of a data or metadata line. They must be on a line by themselves.

Data lines

Data lines describe either a mutation or evidence from an analysis that can potentially support a mutational event. Data fields are tab-delimited. Each line begins with several fields containing information common to all types, continues with a fixed number of type-specific fields, and ends with an arbitrary number of name=value pairs that store optional information.

  1. type <string>

    type of the entry on this line.

  2. id or evidence-id <uint32>

    For evidence and validation lines, the id of this item. For mutation lines, the ids of all evidence or validation items that support this mutation. May be set to '.' if a line was manually edited.

  3. parent-ids <uint32>

    ids of evidence that support this mutation. May be set to '.' or left blank.

mutation types are 3 letters: SNP, SUB, DEL, INS, MOB, AMP, CON, INV.

evidence types are 2 letters: RA, MC, JC, UN.

validation types are 4 letters: TSEQ, PFLP, RFLP, PFGE, PHYL, CURA.

Mutational Event Types

SNP: Base substitution mutation

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment of base to replace.

  3. new_seq <char>

    new base at position.

SUB: Multiple base substitution mutation

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in the reference sequence of the first base that will be replaced.

  3. size <uint32>

    number of bases after the specified reference position to replace with new_seq.

  4. new_seq <string>

    new bases to substitute.

DEL: Deletion mutation

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment of first deleted base.

  3. size <uint32>

    number of bases deleted in reference.

Additional DEL named fields

  • mediated=<mobile_element_family>
    This deletion appears to be mediated by a molecular event involving a mobile element such as a transposon. A copy of the mobile element is found on the boundary of the deleted region and a new junction at the opposite end of the deletion matches the end of the mobile element.

  • between=<repeat_family>
    This deletion appears to result from homologous recombination or polymerase slipping between two existing copies of the same genomic repeat (e.g. tRNA, IS element) in the genome. One copy of the repeat is deleted by this event.

  • repeat_seq=<string>, repeat_length=<uint32>, repeat_ref_num=<uint32>, repeat_new_copies=<uint32>
    This deletion is in a short sequence repeat consisting of tandem copies of repeat_seq repeated repeat_ref_num times in the ancestor and repeat_new_copies after a mutation. To be annotated in this way the copy of the repeat in the reference genome must consist of at least two repeat copies and have a length of five of more total bases (repeat_length × repeat_ref_num ≥ 5).

INS: Insertion mutation

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment. New bases are inserted after this position.

  3. new_seq <string>

    new bases to be inserted in the reference.

Additional INS named fields

  • repeat_seq=<string>, repeat_length=<uint32>, repeat_ref_num=<uint32>, repeat_new_copies=<uint32>
    This insertion is in a short sequence repeat consisting of tandem copies of repeat_seq repeated repeat_ref_num times in the ancestor and repeat_new_copies after a mutation. To be annotated in this way the copy of the repeat in the reference genome must consist of at least two repeat copies and have a length of five of more total bases (repeat_length × repeat_ref_num ≥ 5).

  • insert_position=<uint32>
    Used when there are multiple insertion events after the same reference base to order the insertions. This typically happens in polymorphism mode and when manually breaking up an insertion of bases into distinct mutational events when this is supported by phylogenetic information. Numbering of insert positions begins with 1.

MOB: Mobile element insertion mutation

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment of the first duplicated base at the target site.

  3. repeat_name <string>

    name of the mobile element. Should correspond to an annotated repeat_region or mobile_element feature in the reference sequence.

  4. strand <1/-1>

    strand of mobile element insertion.

  5. duplication_size <uint32>

    number of target site bases duplicated during insertion of the mobile element, beginning with the specified reference position. If the value of this field is negative, then it indicates that the absolute value of this number of bases were deleted at the target site beginning with the specified position. If the value of this field is zero, then the there were no duplicated bases, and the mobile element was inserted after the specified base position.

Additional MOB named fields

  • del_start=<uint32>, del_end=<uint32>
    Delete this many bases from the start or end of the inserted mobile element. This deletion occurs with respect to the top strand of the genome after the element is flipped to the orientation with which it will be inserted.

  • ins_start=<string>, ins_end=<string>
    Append the specified bases to the start or end of the inserted mobile element. These insertions occur after any deletions and will be inside of any duplicated target site bases.

  • mob_region=<seq_id:start-end >
    Use the existing copy of the mobile element specified as a seq_id:start-end region to apply this mutation. Useful when different annotated members of a mobile element family have slightly different sequences.

AMP: Amplification mutation

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment.

  3. size <uint32>

    number of bases duplicated starting with the specified reference position.

  4. new_copy_number <uint32>

    new number of copies of specified bases.

Additional AMP named fields

  • between=<repeat_family>
    This amplification appears to result from homologous recombination or polymerase slipping between two existing copies of the same genomic repeat (e.g. tRNA, IS element) in the genome. This repeat appears on the boundary of each copy of the specified region.

  • mediated=<repeat_family>, mediated_strand=<1/-1>*
    This amplification is mediated by a simultaneous new insertion of a mobile element (or other repeat element). New copies of the inserted element are added in the specified strand orientation between each new copy of the amplified region. Both of these attributes must be specified for the mutation.

  • mob_region=<seq_id:start-end >
    Only valid for 'mediated' amplifications. Use the existing copy of the mobile element specified as a seq_id:start-end region to apply this mutation. Useful when different annotated members of a mobile element family have slightly different sequences.

CON: Gene conversion mutation

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment that was the target of gene conversion from another genomic location.

  3. size <uint32>

    number of bases to replace in the reference genome beginning at the specified position.

  4. region <sequence:start-end>

    Region in the reference genome to use as a replacement.

INT: Integration mutation

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment that was the target of integration from another genomic location.

  3. size <uint32>

    number of bases to replace in the reference genome beginning at the specified position.

  4. region <sequence:start-end>

    Region in the reference genome to use as a replacement.

What is the difference between a CON and an INT?

Gene conversions generally don't add or remove genes from a genome. They just exchange a few bases between homologous genes. Therefore, gene annotations are not changed by applying a CON mutation to a genome. On the other hand, integration implies that new genes are being added to a sequence. Thus, gene annotations are copied over to the inserted bases when applying an INT mutation to a genome.

INV: Inversion mutation

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment.

  3. size <uint32>

    number of bases in inverted region beginning at the specified reference position.

Standard name=value pairs

Counting Mutations

These attributes affect how molecular events in a GenomeDiff are counted for summary purposes. They can be properties of any mutation entry.

  • adjacent=<repeat_family>

    This mutation is adjacent to the specified element. For example, it may be an insertion of a base next to a mobile element. One may want to ignore mutations in this category for certain analyses because they may represent hotspots with atypical mutation rates.

  • with=<mutation_id>

    This mutation should not be counted separately. It should be counted as a single molecular event with the other specified mutation (which does not need a with tag)

Applying Mutations

These advanced attributes control how mutations are applied when using gdtools APPLY to build a new reference genome from the original reference genome and a GenomeDiff and when building phylogenetic trees from multiple samples. They are not generated automatically by breseq.

  • before=<mutation_id>

    Apply this mutation before another mutation. For example, did a base substitution occur after a region was duplicated, thus it is only in one copy or did it occur before the duplication, thus altering both copies? Did a base substitution happen before a deletion, hiding a mutation that should be included in any phylogenetic inference? When this attributes is present, mutations will be applied in order according to their genomic positions.

  • within=<mutation_id>or within=<mutation_id>:<insert_position> or within=<mutation_id>:<copy_index>

    This mutation happens within a different mutation. These options can specify, for example, that a base substitution happens in the second copy of a duplicated region. If <mutation_id> refers to an AMP, then it must be of the form <mutation_id>:<copy_index> and the mutation is placed in the corresponding copy of the specified coordinates. If <mutation_id> refers to an INS, then it must be of the form <mutation_id>:<insert_position>, and the <insert_position> of the mutation will be interpreted as happening within the INS bases that are inserted, so that it can change those new bases. This coordinate is local to the new bases, so an <insert_position> of 1 refers to the first inserted base. In this case, the main <position> of the mutation must be the same as the *<position> of the INS that it is within. If *<mutation_id>* refers to a MOB with no <copy_index>, then the mutation is placed within the newly inserted sequence of the mobile element with the position of the mutation interpreted as happening on the new genome after the MOB bases are inserted. If it refers to a MOB and is of the form <mutation_id>:<copy_index>, then the mutation is placed within the specified copy of the target site duplication.

  • deleted=1

    The sequence change caused by this mutation was made irrelevant by subsequent mutations that deleted or further changed the affected region. Annotation of this mutation in the given genome was inferred based on phylogeny. It will not be applied when generating the mutated genome.

  • apply_size_adjust=<int32>

    When applying the mutation change its size by this number. Usually used for DEL mutations and complicated cases. See the Common Curation Cases tutorial for an example.

Evidence Types

RA: Read alignment evidence

Line specification:

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment.

  3. insert_position <uint32>

    number of bases inserted after the reference position to get to this base. An value of zero refers to the base. A value of 5 means that this evidence if for the fifth newly inserted column after the reference position.

  4. ref_base <char>

    base in the reference genome.

  5. new_base <char>

    new base supported by read alignment evidence.

MC: Missing coverage evidence

Line specification:

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. start <uint32>

    start position in reference sequence fragment.

  3. end <uint32>

    end position in reference sequence of region.

  4. start_range <uint32>

    number of bases to offset after the start position to define the upper limit of the range where the start of a deletion could be.

  5. end_range <uint32>

    number of bases to offset before the end position to define the lower limit of the range where the start of a deletion could be.

Essentially this is evidence of missing coverage between two positions in the ranges [start, start+start_range] [end-end_range, end].

JC: New junction evidence

  1. side_1_seq_id <string>

    id of reference sequence fragment containing side 1 of the junction.

  2. side_1_position <uint32>

    position of side 1 at the junction boundary.

  3. side_1_strand <1/-1>

    direction that side 1 continues matching the reference sequence

  4. side_2_seq_id <string>

    id of reference sequence fragment containing side 2 of the junction.

  5. side_2_position <uint32>

    position of side 2 at the junction boundary.

  6. side_2_strand <1/-1>

    direction that side 2 continues matching the reference sequence.

  7. overlap <uint32>

    Number of bases that the two sides of the new junction have in common.

UN: Unknown base evidence

Line specification:

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. start <uint32>

    start position in reference sequence of region.

  3. end <uint32>

    end position in reference sequence of region.

Validation Types

These items indicate that mutations have been validated by further, targeted experiments.

CURA: True-positive curated by an expert

An expert has examined the data output from a prediction program and determined that this mutations is a true positive.

Line specification:

  1. expert <string>

    Name or initials of the person who predicted the mutation.

FPOS: False-positive curated by an expert

An expert has examined the raw read data and determined that this predicted mutation is a false positive.

Line specification:

  1. expert <string>

    Name or initials of the person who predicted the mutation.

PHYL: Phylogenetic comparison

This validation was transferred from validation in another, related genome.

Line specification:

  1. gd <string>

    Name of the genome_diff file containing the evidence.

TSEQ: Targeted re-sequencing

Line specification:

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. primer1_start <uint32>

    position in reference sequence of the 5' end of primer 1.

  3. primer1_end <uint32>

    position in reference sequence of the 3' end of primer 1.

  4. primer2_start <uint32>

    position in reference sequence of the 5' end of primer 2.

  5. primer2_end <uint32>

    position in reference sequence of the 3' end of primer 2.

For primer 1, start < end. For primer 2, end < start.

PFLP: PCR-fragment length polymorphism

Line specification:

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. primer1_start <uint32>

    position in reference sequence of the 5' end of primer 1.

  3. primer1_end <uint32>

    position in reference sequence of the 3' end of primer 1.

  4. primer2_start <uint32>

    position in reference sequence of the 5' end of primer 2.

  5. primer2_end <uint32>

    position in reference sequence of the 3' end of primer 2.

For primer 1, start < end. For primer 2, end < start.

RFLP: Restriction fragment length polymorphism

Line specification:

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. primer1_start <uint32>

    position in reference sequence of the 5' end of primer 1.

  3. primer1_end <uint32>

    position in reference sequence of the 3' end of primer 1.

  4. primer2_start <uint32>

    position in reference sequence of the 5' end of primer 2.

  5. primer2_end <uint32>

    position in reference sequence of the 3' end of primer 2.

  6. enzyme <string>

    Restriction enzyme used to distinguish reference from mutated allele.

For primer 1, start < end. For primer 2, end < start.

PFGE: Pulsed-field gel electrophoresis

Changes in fragment sizes of genomic DNA digested with restriction enzymes and separated by pulsed-field

Line specification:

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. restriction enzyme <string>

Restriction enzyme used to digest genomic DNA and observe fragments.

NOTE: Note

Generic container for a note about a mutation prediction

Line specification:

  1. note <string>

    Free text note.

MASK: Repeat mask a section

Artificially mask a section of DNA as "N"s. This is useful for creating modified reference sequences, particularly for targeted sequencing approaches. Line specification:

  1. seq_id <string>

    id of reference sequence fragment containing mutation, evidence, or validation.

  2. position <uint32>

    position in reference sequence fragment.

  3. size <uint32>

    number of bases masked to "N" in reference, including reference position.

Clone this wiki locally