Skip to content

Commit

Permalink
Reorganize Cram test files
Browse files Browse the repository at this point in the history
The current filter.t is pretty cumbersome to work with.
See slack thread: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1653072787820709

This change breaks that file into several smaller files.

- Split filter.t into much smaller files
    - For most files, each is setup + one augur filter command.
    - A few files also check the output or run related commands.
- Re-organize supporting files so that everything is under tests/functional/filter/, which has two folders:
    - cram: individual test files
    - data: supporting data (previously directly under tests/functional/filter/)
  • Loading branch information
victorlin committed May 24, 2022
1 parent 99c4d05 commit 5ce415f
Show file tree
Hide file tree
Showing 36 changed files with 615 additions and 514 deletions.
513 changes: 0 additions & 513 deletions tests/functional/filter.t

This file was deleted.

2 changes: 2 additions & 0 deletions tests/functional/filter/cram/_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
pushd "$TESTDIR/../../" > /dev/null
export AUGUR="${AUGUR:-../../bin/augur}"
17 changes: 17 additions & 0 deletions tests/functional/filter/cram/filter-exclude-include.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Filter with exclude query for two regions that comprise all but one strain.
This filter should leave a single record from Oceania.
Force include one South American record by country to get two total records.

$ ${AUGUR} filter \
> --metadata filter/data/metadata.tsv \
> --exclude-where "region=South America" "region=North America" "region=Southeast Asia" \
> --include-where "country=Ecuador" \
> --output-strains "$TMP/filtered_strains.txt" > /dev/null
$ wc -l "$TMP/filtered_strains.txt"
\s*2 .* (re)
$ rm -f "$TMP/filtered_strains.txt"
42 changes: 42 additions & 0 deletions tests/functional/filter/cram/filter-metadata-duplicates-error.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Error on duplicates in metadata within same chunk.

$ cat >$TMP/metadata-duplicates.tsv <<~~
> strain date
> a 2010-10-10
> a 2010-10-10
> b 2010-10-10
> c 2010-10-10
> d 2010-10-10
> ~~
$ ${AUGUR} filter \
> --metadata $TMP/metadata-duplicates.tsv \
> --group-by year \
> --sequences-per-group 2 \
> --subsample-seed 0 \
> --metadata-chunk-size 10 \
> --output-metadata $TMP/metadata-filtered.tsv > /dev/null
ERROR: Duplicate found in .* (re)
[2]
$ cat $TMP/metadata-filtered.tsv
cat: .*: No such file or directory (re)
[1]

Error on duplicates in metadata in separate chunks.

$ ${AUGUR} filter \
> --metadata $TMP/metadata-duplicates.tsv \
> --group-by year \
> --sequences-per-group 2 \
> --subsample-seed 0 \
> --metadata-chunk-size 1 \
> --output-metadata $TMP/metadata-filtered.tsv > /dev/null
ERROR: Duplicate found in .* (re)
[2]
$ cat $TMP/metadata-filtered.tsv
cat: .*: No such file or directory (re)
[1]
14 changes: 14 additions & 0 deletions tests/functional/filter/cram/filter-metadata-not-found-error.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Try to filter on an metadata file that does not exist.

$ ${AUGUR} filter \
> --metadata file-does-not-exist.tsv \
> --group-by year month \
> --sequences-per-group 1 \
> --output-strains "$TMP/filtered_strains.txt" > /dev/null
ERROR: No such file or directory: 'file-does-not-exist.tsv'
[2]
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Confirm that filtering omits strains without metadata or sequences.
The input sequences are missing one strain that is in the metadata.
The metadata are missing one strain that has a sequence.
The list of strains to include has one strain with no metadata/sequence and one strain with information that would have been filtered by country.
The query initially filters 3 strains from Colombia, one of which is added back by the include.

$ ${AUGUR} filter \
> --sequence-index filter/data/sequence_index.tsv \
> --metadata filter/data/metadata.tsv \
> --query "country != 'Colombia'" \
> --non-nucleotide \
> --exclude-ambiguous-dates-by year \
> --include filter/data/include.txt \
> --output-strains "$TMP/filtered_strains.txt" \
> --output-log "$TMP/filtered_log.tsv"
4 strains were dropped during filtering
\t1 had no metadata (esc)
\t1 had no sequence data (esc)
\t3 of these were filtered out by the query: "country != 'Colombia'" (esc)
\t1 strains were added back because they were in filter/data/include.txt (esc)
9 strains passed all filters

$ diff -u <(sort -k 1,1 filter/data/filtered_log.tsv) <(sort -k 1,1 "$TMP/filtered_log.tsv")
$ rm -f "$TMP/filtered_strains.txt"
13 changes: 13 additions & 0 deletions tests/functional/filter/cram/filter-min-date.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Filter using only metadata without a sequence index.
This should work because the requested filters don't rely on sequence information.

$ ${AUGUR} filter \
> --metadata filter/data/metadata.tsv \
> --min-date 2012 \
> --output-strains "$TMP/filtered_strains.txt" > /dev/null
$ rm -f "$TMP/filtered_strains.txt"
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Try to filter using only metadata without a sequence index.
This should fail because the requested filters rely on sequence information.

$ ${AUGUR} filter \
> --metadata filter/data/metadata.tsv \
> --min-length 10000 \
> --output-strains "$TMP/filtered_strains.txt" > /dev/null
ERROR: You need to provide a sequence index or sequences to filter on sequence-specific information.
[1]
19 changes: 19 additions & 0 deletions tests/functional/filter/cram/filter-min-length-output-metadata.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Filter using only metadata without sequence input or output and save results as filtered metadata.

$ ${AUGUR} filter \
> --sequence-index filter/data/sequence_index.tsv \
> --metadata filter/data/metadata.tsv \
> --min-date 2012 \
> --min-length 10500 \
> --output-metadata "$TMP/filtered_metadata.tsv" > /dev/null

Output should include the 8 sequences matching the filters and a header line.

$ wc -l "$TMP/filtered_metadata.tsv"
\s*9 .* (re)
$ rm -f "$TMP/filtered_metadata.tsv"
19 changes: 19 additions & 0 deletions tests/functional/filter/cram/filter-min-length-output-strains.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Filter using only metadata and save results as a list of filtered strains.

$ ${AUGUR} filter \
> --sequence-index filter/data/sequence_index.tsv \
> --metadata filter/data/metadata.tsv \
> --min-date 2012 \
> --min-length 10500 \
> --output-strains "$TMP/filtered_strains.txt" > /dev/null

Output should include only the 8 sequences matching the filters (without a header line).

$ wc -l "$TMP/filtered_strains.txt"
\s*8 .* (re)
$ rm -f "$TMP/filtered_strains.txt"
16 changes: 16 additions & 0 deletions tests/functional/filter/cram/filter-min-max-date-output.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Check output of min/max date filters.

$ ${AUGUR} filter \
> --metadata filter/data/metadata.tsv \
> --min-date 2015-01-01 \
> --max-date 2016-02-01 \
> --output-metadata "$TMP/filtered_metadata.tsv"
8 strains were dropped during filtering
\t1 of these were dropped because they were earlier than 2015.0 or missing a date (esc)
\t7 of these were dropped because they were later than 2016.09 or missing a date (esc)
4 strains passed all filters
53 changes: 53 additions & 0 deletions tests/functional/filter/cram/filter-mismatched-sequences-error.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Try to filter with sequences that don't match any of the metadata.
This should produce no results because the intersection of metadata and sequences is empty.

$ echo -e ">something\nATCG" > "$TMP/dummy.fasta"
$ ${AUGUR} filter \
> --sequences "$TMP/dummy.fasta" \
> --metadata filter/data/metadata.tsv \
> --min-length 4 \
> --max-date 2020-01-30 \
> --output-strains "$TMP/filtered_strains.txt" > /dev/null
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
ERROR: All samples have been dropped! Check filter rules and metadata file format.
[1]
$ wc -l "$TMP/filtered_strains.txt"
\s*0 .* (re)
$ rm -f "$TMP/filtered_strains.txt"

Repeat with sequence and strain outputs. We should get the same results.

$ ${AUGUR} filter \
> --sequences "$TMP/dummy.fasta" \
> --metadata filter/data/metadata.tsv \
> --max-date 2020-01-30 \
> --output-strains "$TMP/filtered_strains.txt" \
> --output-sequences "$TMP/filtered.fasta" > /dev/null
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
ERROR: All samples have been dropped! Check filter rules and metadata file format.
[1]
$ wc -l "$TMP/filtered_strains.txt"
\s*0 .* (re)
$ grep "^>" "$TMP/filtered.fasta" | wc -l
\s*0 (re)
$ rm -f "$TMP/filtered_strains.txt"
$ rm -f "$TMP/filtered.fasta"

Repeat without any sequence-based filters.
Since we expect metadata to be filtered by presence of strains in input sequences, this should produce no results because the intersection of metadata and sequences is empty.

$ ${AUGUR} filter \
> --sequences "$TMP/dummy.fasta" \
> --metadata filter/data/metadata.tsv \
> --output-strains "$TMP/filtered_strains.txt" > /dev/null
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
ERROR: All samples have been dropped! Check filter rules and metadata file format.
[1]
$ wc -l "$TMP/filtered_strains.txt"
\s*0 .* (re)
$ rm -f "$TMP/filtered_strains.txt"
13 changes: 13 additions & 0 deletions tests/functional/filter/cram/filter-no-outputs-error.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Try to filter without any outputs.

$ ${AUGUR} filter \
> --sequence-index filter/data/sequence_index.tsv \
> --metadata filter/data/metadata.tsv \
> --min-length 10000 > /dev/null
ERROR: You need to select at least one output.
[1]
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Try to output to a directory that does not exist.

$ ${AUGUR} filter \
> --metadata filter/data/metadata.tsv \
> --group-by year month \
> --sequences-per-group 1 \
> --output-strains "directory-does-not-exist/filtered_strains.txt" > /dev/null
ERROR: No such file or directory: 'directory-does-not-exist/filtered_strains.txt'
[2]
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Try to filter with sequence outputs and no sequence inputs.
This should fail.

$ ${AUGUR} filter \
> --sequence-index filter/data/sequence_index.tsv \
> --metadata filter/data/metadata.tsv \
> --min-length 10000 \
> --output "$TMP/filtered.fasta" > /dev/null
ERROR: You need to provide sequences to output sequences.
[1]
68 changes: 68 additions & 0 deletions tests/functional/filter/cram/filter-query-example.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Filter into two separate sets and then select sequences from the union of those sets.
First, select strains from Brazil (there should be 1).

$ ${AUGUR} filter \
> --metadata filter/data/metadata.tsv \
> --query "country == 'Brazil'" \
> --output-strains "$TMP/filtered_strains.brazil.txt" > /dev/null
$ wc -l "$TMP/filtered_strains.brazil.txt"
\s*1 .* (re)

Then, select strains from Colombia (there should be 3).

$ ${AUGUR} filter \
> --metadata filter/data/metadata.tsv \
> --query "country == 'Colombia'" \
> --output-strains "$TMP/filtered_strains.colombia.txt" > /dev/null
$ wc -l "$TMP/filtered_strains.colombia.txt"
\s*3 .* (re)

Finally, exclude all sequences except those from the two sets of strains (there should be 4).

$ ${AUGUR} filter \
> --sequences filter/data/sequences.fasta \
> --sequence-index filter/data/sequence_index.tsv \
> --metadata filter/data/metadata.tsv \
> --exclude-all \
> --include "$TMP/filtered_strains.brazil.txt" "$TMP/filtered_strains.colombia.txt" \
> --output "$TMP/filtered.fasta" > /dev/null
$ grep "^>" "$TMP/filtered.fasta" | wc -l
\s*4 (re)
$ rm -f "$TMP/filtered.fasta"

Repeat this filter without a sequence index.
We should get the same outputs without building a sequence index on the fly, because the exclude-all flag tells us we only want to force-include strains and skip all other filters.

$ ${AUGUR} filter \
> --sequences filter/data/sequences.fasta \
> --metadata filter/data/metadata.tsv \
> --exclude-all \
> --include "$TMP/filtered_strains.brazil.txt" "$TMP/filtered_strains.colombia.txt" \
> --output "$TMP/filtered.fasta" \
> --output-metadata "$TMP/filtered.tsv" > /dev/null
$ grep "^>" "$TMP/filtered.fasta" | wc -l
\s*4 (re)
$ rm -f "$TMP/filtered.fasta"

Metadata should have the same number of records as the sequences plus a header.

$ wc -l "$TMP/filtered.tsv"
\s*5 .* (re)
$ rm -f "$TMP/filtered.tsv"

Alternately, exclude the sequences from Brazil and Colombia (N=4) and records without sequences (N=1) or metadata (N=1).

$ ${AUGUR} filter \
> --sequences filter/data/sequences.fasta \
> --sequence-index filter/data/sequence_index.tsv \
> --metadata filter/data/metadata.tsv \
> --exclude "$TMP/filtered_strains.brazil.txt" "$TMP/filtered_strains.colombia.txt" \
> --output "$TMP/filtered.fasta" > /dev/null
$ grep "^>" "$TMP/filtered.fasta" | wc -l
\s*7 (re)
$ rm -f "$TMP/filtered.fasta"
16 changes: 16 additions & 0 deletions tests/functional/filter/cram/filter-sequences-vcf.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Filter TB strains from VCF and save as a list of filtered strains.

$ ${AUGUR} filter \
> --sequences filter/data/tb.vcf.gz \
> --metadata filter/data/tb_metadata.tsv \
> --min-date 2012 \
> --output-strains "$TMP/filtered_strains.txt" > /dev/null
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
$ wc -l "$TMP/filtered_strains.txt"
\s*3 .* (re)
$ rm -f "$TMP/filtered_strains.txt"
Loading

0 comments on commit 5ce415f

Please sign in to comment.