Reorganize Cram test files

The current filter.t is pretty cumbersome to work with. See slack thread: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1653072787820709 This change breaks that file into several smaller files. - Split filter.t into much smaller files - For most files, each is setup + one augur filter command. - A few files also check the output or run related commands. - Re-organize supporting files so that everything is under tests/functional/filter/, which has two folders: - cram: individual test files - data: supporting data (previously directly under tests/functional/filter/)
nextstrain · May 24, 2022 · 5ce415f · 5ce415f
1 parent 99c4d05
commit 5ce415f
Show file tree

Hide file tree

Showing 36 changed files with 615 additions and 514 deletions.
diff --git a/tests/functional/filter.t b/tests/functional/filter.t
diff --git a/tests/functional/filter/cram/_setup.sh b/tests/functional/filter/cram/_setup.sh
@@ -0,0 +1,2 @@
+pushd "$TESTDIR/../../" > /dev/null
+export AUGUR="${AUGUR:-../../bin/augur}"
diff --git a/tests/functional/filter/cram/filter-exclude-include.t b/tests/functional/filter/cram/filter-exclude-include.t
@@ -0,0 +1,17 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Filter with exclude query for two regions that comprise all but one strain.
+This filter should leave a single record from Oceania.
+Force include one South American record by country to get two total records.
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/data/metadata.tsv \
+  >  --exclude-where "region=South America" "region=North America" "region=Southeast Asia" \
+  >  --include-where "country=Ecuador" \
+  >  --output-strains "$TMP/filtered_strains.txt" > /dev/null
+  $ wc -l "$TMP/filtered_strains.txt"
+  \s*2 .* (re)
+  $ rm -f "$TMP/filtered_strains.txt"
diff --git a/tests/functional/filter/cram/filter-metadata-duplicates-error.t b/tests/functional/filter/cram/filter-metadata-duplicates-error.t
@@ -0,0 +1,42 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Error on duplicates in metadata within same chunk.
+
+  $ cat >$TMP/metadata-duplicates.tsv <<~~
+  > strain	date
+  > a	2010-10-10
+  > a	2010-10-10
+  > b	2010-10-10
+  > c	2010-10-10
+  > d	2010-10-10
+  > ~~
+  $ ${AUGUR} filter \
+  >   --metadata $TMP/metadata-duplicates.tsv \
+  >   --group-by year \
+  >   --sequences-per-group 2 \
+  >   --subsample-seed 0 \
+  >   --metadata-chunk-size 10 \
+  >   --output-metadata $TMP/metadata-filtered.tsv > /dev/null
+  ERROR: Duplicate found in .* (re)
+  [2]
+  $ cat $TMP/metadata-filtered.tsv
+  cat: .*: No such file or directory (re)
+  [1]
+
+Error on duplicates in metadata in separate chunks.
+
+  $ ${AUGUR} filter \
+  >   --metadata $TMP/metadata-duplicates.tsv \
+  >   --group-by year \
+  >   --sequences-per-group 2 \
+  >   --subsample-seed 0 \
+  >   --metadata-chunk-size 1 \
+  >   --output-metadata $TMP/metadata-filtered.tsv > /dev/null
+  ERROR: Duplicate found in .* (re)
+  [2]
+  $ cat $TMP/metadata-filtered.tsv
+  cat: .*: No such file or directory (re)
+  [1]
diff --git a/tests/functional/filter/cram/filter-metadata-not-found-error.t b/tests/functional/filter/cram/filter-metadata-not-found-error.t
@@ -0,0 +1,14 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Try to filter on an metadata file that does not exist.
+
+  $ ${AUGUR} filter \
+  >  --metadata file-does-not-exist.tsv \
+  >  --group-by year month \
+  >  --sequences-per-group 1 \
+  >  --output-strains "$TMP/filtered_strains.txt" > /dev/null
+  ERROR: No such file or directory: 'file-does-not-exist.tsv'
+  [2]
diff --git a/tests/functional/filter/cram/filter-metadata-sequence-strains-mismatch.t b/tests/functional/filter/cram/filter-metadata-sequence-strains-mismatch.t
@@ -0,0 +1,29 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Confirm that filtering omits strains without metadata or sequences.
+The input sequences are missing one strain that is in the metadata.
+The metadata are missing one strain that has a sequence.
+The list of strains to include has one strain with no metadata/sequence and one strain with information that would have been filtered by country.
+The query initially filters 3 strains from Colombia, one of which is added back by the include.
+
+  $ ${AUGUR} filter \
+  >  --sequence-index filter/data/sequence_index.tsv \
+  >  --metadata filter/data/metadata.tsv \
+  >  --query "country != 'Colombia'" \
+  >  --non-nucleotide \
+  >  --exclude-ambiguous-dates-by year \
+  >  --include filter/data/include.txt \
+  >  --output-strains "$TMP/filtered_strains.txt" \
+  >  --output-log "$TMP/filtered_log.tsv"
+  4 strains were dropped during filtering
+  \t1 had no metadata (esc)
+  \t1 had no sequence data (esc)
+  \t3 of these were filtered out by the query: "country != 'Colombia'" (esc)
+  \t1 strains were added back because they were in filter/data/include.txt (esc)
+  9 strains passed all filters
+
+  $ diff -u <(sort -k 1,1 filter/data/filtered_log.tsv) <(sort -k 1,1 "$TMP/filtered_log.tsv")
+  $ rm -f "$TMP/filtered_strains.txt"
diff --git a/tests/functional/filter/cram/filter-min-date.t b/tests/functional/filter/cram/filter-min-date.t
@@ -0,0 +1,13 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Filter using only metadata without a sequence index.
+This should work because the requested filters don't rely on sequence information.
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/data/metadata.tsv \
+  >  --min-date 2012 \
+  >  --output-strains "$TMP/filtered_strains.txt" > /dev/null
+  $ rm -f "$TMP/filtered_strains.txt"
diff --git a/tests/functional/filter/cram/filter-min-length-no-sequence-index-error.t b/tests/functional/filter/cram/filter-min-length-no-sequence-index-error.t
@@ -0,0 +1,14 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Try to filter using only metadata without a sequence index.
+This should fail because the requested filters rely on sequence information.
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/data/metadata.tsv \
+  >  --min-length 10000 \
+  >  --output-strains "$TMP/filtered_strains.txt" > /dev/null
+  ERROR: You need to provide a sequence index or sequences to filter on sequence-specific information.
+  [1]
diff --git a/tests/functional/filter/cram/filter-min-length-output-metadata.t b/tests/functional/filter/cram/filter-min-length-output-metadata.t
@@ -0,0 +1,19 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Filter using only metadata without sequence input or output and save results as filtered metadata.
+
+  $ ${AUGUR} filter \
+  >  --sequence-index filter/data/sequence_index.tsv \
+  >  --metadata filter/data/metadata.tsv \
+  >  --min-date 2012 \
+  >  --min-length 10500 \
+  >  --output-metadata "$TMP/filtered_metadata.tsv" > /dev/null
+
+Output should include the 8 sequences matching the filters and a header line.
+
+  $ wc -l "$TMP/filtered_metadata.tsv"
+  \s*9 .* (re)
+  $ rm -f "$TMP/filtered_metadata.tsv"
diff --git a/tests/functional/filter/cram/filter-min-length-output-strains.t b/tests/functional/filter/cram/filter-min-length-output-strains.t
@@ -0,0 +1,19 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Filter using only metadata and save results as a list of filtered strains.
+
+  $ ${AUGUR} filter \
+  >  --sequence-index filter/data/sequence_index.tsv \
+  >  --metadata filter/data/metadata.tsv \
+  >  --min-date 2012 \
+  >  --min-length 10500 \
+  >  --output-strains "$TMP/filtered_strains.txt" > /dev/null
+
+Output should include only the 8 sequences matching the filters (without a header line).
+
+  $ wc -l "$TMP/filtered_strains.txt"
+  \s*8 .* (re)
+  $ rm -f "$TMP/filtered_strains.txt"
diff --git a/tests/functional/filter/cram/filter-min-max-date-output.t b/tests/functional/filter/cram/filter-min-max-date-output.t
@@ -0,0 +1,16 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Check output of min/max date filters.
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/data/metadata.tsv \
+  >  --min-date 2015-01-01 \
+  >  --max-date 2016-02-01 \
+  >  --output-metadata "$TMP/filtered_metadata.tsv"
+  8 strains were dropped during filtering
+  \t1 of these were dropped because they were earlier than 2015.0 or missing a date (esc)
+  \t7 of these were dropped because they were later than 2016.09 or missing a date (esc)
+  4 strains passed all filters
diff --git a/tests/functional/filter/cram/filter-mismatched-sequences-error.t b/tests/functional/filter/cram/filter-mismatched-sequences-error.t
@@ -0,0 +1,53 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Try to filter with sequences that don't match any of the metadata.
+This should produce no results because the intersection of metadata and sequences is empty.
+
+  $ echo -e ">something\nATCG" > "$TMP/dummy.fasta"
+  $ ${AUGUR} filter \
+  >  --sequences "$TMP/dummy.fasta" \
+  >  --metadata filter/data/metadata.tsv \
+  >  --min-length 4 \
+  >  --max-date 2020-01-30 \
+  >  --output-strains "$TMP/filtered_strains.txt" > /dev/null
+  Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
+  ERROR: All samples have been dropped! Check filter rules and metadata file format.
+  [1]
+  $ wc -l "$TMP/filtered_strains.txt"
+  \s*0 .* (re)
+  $ rm -f "$TMP/filtered_strains.txt"
+
+Repeat with sequence and strain outputs. We should get the same results.
+
+  $ ${AUGUR} filter \
+  >  --sequences "$TMP/dummy.fasta" \
+  >  --metadata filter/data/metadata.tsv \
+  >  --max-date 2020-01-30 \
+  >  --output-strains "$TMP/filtered_strains.txt" \
+  >  --output-sequences "$TMP/filtered.fasta" > /dev/null
+  Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
+  ERROR: All samples have been dropped! Check filter rules and metadata file format.
+  [1]
+  $ wc -l "$TMP/filtered_strains.txt"
+  \s*0 .* (re)
+  $ grep "^>" "$TMP/filtered.fasta" | wc -l
+  \s*0 (re)
+  $ rm -f "$TMP/filtered_strains.txt"
+  $ rm -f "$TMP/filtered.fasta"
+
+Repeat without any sequence-based filters.
+Since we expect metadata to be filtered by presence of strains in input sequences, this should produce no results because the intersection of metadata and sequences is empty.
+
+  $ ${AUGUR} filter \
+  >  --sequences "$TMP/dummy.fasta" \
+  >  --metadata filter/data/metadata.tsv \
+  >  --output-strains "$TMP/filtered_strains.txt" > /dev/null
+  Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
+  ERROR: All samples have been dropped! Check filter rules and metadata file format.
+  [1]
+  $ wc -l "$TMP/filtered_strains.txt"
+  \s*0 .* (re)
+  $ rm -f "$TMP/filtered_strains.txt"
diff --git a/tests/functional/filter/cram/filter-no-outputs-error.t b/tests/functional/filter/cram/filter-no-outputs-error.t
@@ -0,0 +1,13 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Try to filter without any outputs.
+
+  $ ${AUGUR} filter \
+  >  --sequence-index filter/data/sequence_index.tsv \
+  >  --metadata filter/data/metadata.tsv \
+  >  --min-length 10000 > /dev/null
+  ERROR: You need to select at least one output.
+  [1]
diff --git a/tests/functional/filter/cram/filter-output-directory-not-found-error.t b/tests/functional/filter/cram/filter-output-directory-not-found-error.t
@@ -0,0 +1,14 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Try to output to a directory that does not exist.
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/data/metadata.tsv \
+  >  --group-by year month \
+  >  --sequences-per-group 1 \
+  >  --output-strains "directory-does-not-exist/filtered_strains.txt" > /dev/null
+  ERROR: No such file or directory: 'directory-does-not-exist/filtered_strains.txt'
+  [2]
diff --git a/tests/functional/filter/cram/filter-output-strains-no-sequence-error.t b/tests/functional/filter/cram/filter-output-strains-no-sequence-error.t
@@ -0,0 +1,15 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Try to filter with sequence outputs and no sequence inputs.
+This should fail.
+
+  $ ${AUGUR} filter \
+  >  --sequence-index filter/data/sequence_index.tsv \
+  >  --metadata filter/data/metadata.tsv \
+  >  --min-length 10000 \
+  >  --output "$TMP/filtered.fasta" > /dev/null
+  ERROR: You need to provide sequences to output sequences.
+  [1]
diff --git a/tests/functional/filter/cram/filter-query-example.t b/tests/functional/filter/cram/filter-query-example.t
@@ -0,0 +1,68 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Filter into two separate sets and then select sequences from the union of those sets.
+First, select strains from Brazil (there should be 1).
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/data/metadata.tsv \
+  >  --query "country == 'Brazil'" \
+  >  --output-strains "$TMP/filtered_strains.brazil.txt" > /dev/null
+  $ wc -l "$TMP/filtered_strains.brazil.txt"
+  \s*1 .* (re)
+
+Then, select strains from Colombia (there should be 3).
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/data/metadata.tsv \
+  >  --query "country == 'Colombia'" \
+  >  --output-strains "$TMP/filtered_strains.colombia.txt" > /dev/null
+  $ wc -l "$TMP/filtered_strains.colombia.txt"
+  \s*3 .* (re)
+
+Finally, exclude all sequences except those from the two sets of strains (there should be 4).
+
+  $ ${AUGUR} filter \
+  >  --sequences filter/data/sequences.fasta \
+  >  --sequence-index filter/data/sequence_index.tsv \
+  >  --metadata filter/data/metadata.tsv \
+  >  --exclude-all \
+  >  --include "$TMP/filtered_strains.brazil.txt" "$TMP/filtered_strains.colombia.txt" \
+  >  --output "$TMP/filtered.fasta" > /dev/null
+  $ grep "^>" "$TMP/filtered.fasta" | wc -l
+  \s*4 (re)
+  $ rm -f "$TMP/filtered.fasta"
+
+Repeat this filter without a sequence index.
+We should get the same outputs without building a sequence index on the fly, because the exclude-all flag tells us we only want to force-include strains and skip all other filters.
+
+  $ ${AUGUR} filter \
+  >  --sequences filter/data/sequences.fasta \
+  >  --metadata filter/data/metadata.tsv \
+  >  --exclude-all \
+  >  --include "$TMP/filtered_strains.brazil.txt" "$TMP/filtered_strains.colombia.txt" \
+  >  --output "$TMP/filtered.fasta" \
+  >  --output-metadata "$TMP/filtered.tsv" > /dev/null
+  $ grep "^>" "$TMP/filtered.fasta" | wc -l
+  \s*4 (re)
+  $ rm -f "$TMP/filtered.fasta"
+
+Metadata should have the same number of records as the sequences plus a header.
+
+  $ wc -l "$TMP/filtered.tsv"
+  \s*5 .* (re)
+  $ rm -f "$TMP/filtered.tsv"
+
+Alternately, exclude the sequences from Brazil and Colombia (N=4) and records without sequences (N=1) or metadata (N=1).
+
+  $ ${AUGUR} filter \
+  >  --sequences filter/data/sequences.fasta \
+  >  --sequence-index filter/data/sequence_index.tsv \
+  >  --metadata filter/data/metadata.tsv \
+  >  --exclude "$TMP/filtered_strains.brazil.txt" "$TMP/filtered_strains.colombia.txt" \
+  >  --output "$TMP/filtered.fasta" > /dev/null
+  $ grep "^>" "$TMP/filtered.fasta" | wc -l
+  \s*7 (re)
+  $ rm -f "$TMP/filtered.fasta"
diff --git a/tests/functional/filter/cram/filter-sequences-vcf.t b/tests/functional/filter/cram/filter-sequences-vcf.t
@@ -0,0 +1,16 @@
+Setup
+
+  $ pushd "$TESTDIR" > /dev/null
+  $ source _setup.sh
+
+Filter TB strains from VCF and save as a list of filtered strains.
+
+  $ ${AUGUR} filter \
+  >  --sequences filter/data/tb.vcf.gz \
+  >  --metadata filter/data/tb_metadata.tsv \
+  >  --min-date 2012 \
+  >  --output-strains "$TMP/filtered_strains.txt" > /dev/null
+  Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
+  $ wc -l "$TMP/filtered_strains.txt"
+  \s*3 .* (re)
+  $ rm -f "$TMP/filtered_strains.txt"