-
Notifications
You must be signed in to change notification settings - Fork 128
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
The current filter.t is pretty cumbersome to work with. See slack thread: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1653072787820709 This change breaks that file into several smaller files. - Split filter.t into much smaller files - For most files, each is setup + one augur filter command. - A few files also check the output or run related commands. - Re-organize supporting files so that everything is under tests/functional/filter/, which has two folders: - cram: individual test files - data: supporting data (previously directly under tests/functional/filter/)
- Loading branch information
Showing
36 changed files
with
615 additions
and
514 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
pushd "$TESTDIR/../../" > /dev/null | ||
export AUGUR="${AUGUR:-../../bin/augur}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Filter with exclude query for two regions that comprise all but one strain. | ||
This filter should leave a single record from Oceania. | ||
Force include one South American record by country to get two total records. | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --exclude-where "region=South America" "region=North America" "region=Southeast Asia" \ | ||
> --include-where "country=Ecuador" \ | ||
> --output-strains "$TMP/filtered_strains.txt" > /dev/null | ||
$ wc -l "$TMP/filtered_strains.txt" | ||
\s*2 .* (re) | ||
$ rm -f "$TMP/filtered_strains.txt" |
42 changes: 42 additions & 0 deletions
42
tests/functional/filter/cram/filter-metadata-duplicates-error.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Error on duplicates in metadata within same chunk. | ||
|
||
$ cat >$TMP/metadata-duplicates.tsv <<~~ | ||
> strain date | ||
> a 2010-10-10 | ||
> a 2010-10-10 | ||
> b 2010-10-10 | ||
> c 2010-10-10 | ||
> d 2010-10-10 | ||
> ~~ | ||
$ ${AUGUR} filter \ | ||
> --metadata $TMP/metadata-duplicates.tsv \ | ||
> --group-by year \ | ||
> --sequences-per-group 2 \ | ||
> --subsample-seed 0 \ | ||
> --metadata-chunk-size 10 \ | ||
> --output-metadata $TMP/metadata-filtered.tsv > /dev/null | ||
ERROR: Duplicate found in .* (re) | ||
[2] | ||
$ cat $TMP/metadata-filtered.tsv | ||
cat: .*: No such file or directory (re) | ||
[1] | ||
|
||
Error on duplicates in metadata in separate chunks. | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata $TMP/metadata-duplicates.tsv \ | ||
> --group-by year \ | ||
> --sequences-per-group 2 \ | ||
> --subsample-seed 0 \ | ||
> --metadata-chunk-size 1 \ | ||
> --output-metadata $TMP/metadata-filtered.tsv > /dev/null | ||
ERROR: Duplicate found in .* (re) | ||
[2] | ||
$ cat $TMP/metadata-filtered.tsv | ||
cat: .*: No such file or directory (re) | ||
[1] |
14 changes: 14 additions & 0 deletions
14
tests/functional/filter/cram/filter-metadata-not-found-error.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Try to filter on an metadata file that does not exist. | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata file-does-not-exist.tsv \ | ||
> --group-by year month \ | ||
> --sequences-per-group 1 \ | ||
> --output-strains "$TMP/filtered_strains.txt" > /dev/null | ||
ERROR: No such file or directory: 'file-does-not-exist.tsv' | ||
[2] |
29 changes: 29 additions & 0 deletions
29
tests/functional/filter/cram/filter-metadata-sequence-strains-mismatch.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Confirm that filtering omits strains without metadata or sequences. | ||
The input sequences are missing one strain that is in the metadata. | ||
The metadata are missing one strain that has a sequence. | ||
The list of strains to include has one strain with no metadata/sequence and one strain with information that would have been filtered by country. | ||
The query initially filters 3 strains from Colombia, one of which is added back by the include. | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequence-index filter/data/sequence_index.tsv \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --query "country != 'Colombia'" \ | ||
> --non-nucleotide \ | ||
> --exclude-ambiguous-dates-by year \ | ||
> --include filter/data/include.txt \ | ||
> --output-strains "$TMP/filtered_strains.txt" \ | ||
> --output-log "$TMP/filtered_log.tsv" | ||
4 strains were dropped during filtering | ||
\t1 had no metadata (esc) | ||
\t1 had no sequence data (esc) | ||
\t3 of these were filtered out by the query: "country != 'Colombia'" (esc) | ||
\t1 strains were added back because they were in filter/data/include.txt (esc) | ||
9 strains passed all filters | ||
|
||
$ diff -u <(sort -k 1,1 filter/data/filtered_log.tsv) <(sort -k 1,1 "$TMP/filtered_log.tsv") | ||
$ rm -f "$TMP/filtered_strains.txt" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Filter using only metadata without a sequence index. | ||
This should work because the requested filters don't rely on sequence information. | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --min-date 2012 \ | ||
> --output-strains "$TMP/filtered_strains.txt" > /dev/null | ||
$ rm -f "$TMP/filtered_strains.txt" |
14 changes: 14 additions & 0 deletions
14
tests/functional/filter/cram/filter-min-length-no-sequence-index-error.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Try to filter using only metadata without a sequence index. | ||
This should fail because the requested filters rely on sequence information. | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --min-length 10000 \ | ||
> --output-strains "$TMP/filtered_strains.txt" > /dev/null | ||
ERROR: You need to provide a sequence index or sequences to filter on sequence-specific information. | ||
[1] |
19 changes: 19 additions & 0 deletions
19
tests/functional/filter/cram/filter-min-length-output-metadata.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Filter using only metadata without sequence input or output and save results as filtered metadata. | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequence-index filter/data/sequence_index.tsv \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --min-date 2012 \ | ||
> --min-length 10500 \ | ||
> --output-metadata "$TMP/filtered_metadata.tsv" > /dev/null | ||
|
||
Output should include the 8 sequences matching the filters and a header line. | ||
|
||
$ wc -l "$TMP/filtered_metadata.tsv" | ||
\s*9 .* (re) | ||
$ rm -f "$TMP/filtered_metadata.tsv" |
19 changes: 19 additions & 0 deletions
19
tests/functional/filter/cram/filter-min-length-output-strains.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Filter using only metadata and save results as a list of filtered strains. | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequence-index filter/data/sequence_index.tsv \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --min-date 2012 \ | ||
> --min-length 10500 \ | ||
> --output-strains "$TMP/filtered_strains.txt" > /dev/null | ||
|
||
Output should include only the 8 sequences matching the filters (without a header line). | ||
|
||
$ wc -l "$TMP/filtered_strains.txt" | ||
\s*8 .* (re) | ||
$ rm -f "$TMP/filtered_strains.txt" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Check output of min/max date filters. | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --min-date 2015-01-01 \ | ||
> --max-date 2016-02-01 \ | ||
> --output-metadata "$TMP/filtered_metadata.tsv" | ||
8 strains were dropped during filtering | ||
\t1 of these were dropped because they were earlier than 2015.0 or missing a date (esc) | ||
\t7 of these were dropped because they were later than 2016.09 or missing a date (esc) | ||
4 strains passed all filters |
53 changes: 53 additions & 0 deletions
53
tests/functional/filter/cram/filter-mismatched-sequences-error.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Try to filter with sequences that don't match any of the metadata. | ||
This should produce no results because the intersection of metadata and sequences is empty. | ||
|
||
$ echo -e ">something\nATCG" > "$TMP/dummy.fasta" | ||
$ ${AUGUR} filter \ | ||
> --sequences "$TMP/dummy.fasta" \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --min-length 4 \ | ||
> --max-date 2020-01-30 \ | ||
> --output-strains "$TMP/filtered_strains.txt" > /dev/null | ||
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`. | ||
ERROR: All samples have been dropped! Check filter rules and metadata file format. | ||
[1] | ||
$ wc -l "$TMP/filtered_strains.txt" | ||
\s*0 .* (re) | ||
$ rm -f "$TMP/filtered_strains.txt" | ||
|
||
Repeat with sequence and strain outputs. We should get the same results. | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequences "$TMP/dummy.fasta" \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --max-date 2020-01-30 \ | ||
> --output-strains "$TMP/filtered_strains.txt" \ | ||
> --output-sequences "$TMP/filtered.fasta" > /dev/null | ||
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`. | ||
ERROR: All samples have been dropped! Check filter rules and metadata file format. | ||
[1] | ||
$ wc -l "$TMP/filtered_strains.txt" | ||
\s*0 .* (re) | ||
$ grep "^>" "$TMP/filtered.fasta" | wc -l | ||
\s*0 (re) | ||
$ rm -f "$TMP/filtered_strains.txt" | ||
$ rm -f "$TMP/filtered.fasta" | ||
|
||
Repeat without any sequence-based filters. | ||
Since we expect metadata to be filtered by presence of strains in input sequences, this should produce no results because the intersection of metadata and sequences is empty. | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequences "$TMP/dummy.fasta" \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --output-strains "$TMP/filtered_strains.txt" > /dev/null | ||
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`. | ||
ERROR: All samples have been dropped! Check filter rules and metadata file format. | ||
[1] | ||
$ wc -l "$TMP/filtered_strains.txt" | ||
\s*0 .* (re) | ||
$ rm -f "$TMP/filtered_strains.txt" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Try to filter without any outputs. | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequence-index filter/data/sequence_index.tsv \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --min-length 10000 > /dev/null | ||
ERROR: You need to select at least one output. | ||
[1] |
14 changes: 14 additions & 0 deletions
14
tests/functional/filter/cram/filter-output-directory-not-found-error.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Try to output to a directory that does not exist. | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --group-by year month \ | ||
> --sequences-per-group 1 \ | ||
> --output-strains "directory-does-not-exist/filtered_strains.txt" > /dev/null | ||
ERROR: No such file or directory: 'directory-does-not-exist/filtered_strains.txt' | ||
[2] |
15 changes: 15 additions & 0 deletions
15
tests/functional/filter/cram/filter-output-strains-no-sequence-error.t
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Try to filter with sequence outputs and no sequence inputs. | ||
This should fail. | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequence-index filter/data/sequence_index.tsv \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --min-length 10000 \ | ||
> --output "$TMP/filtered.fasta" > /dev/null | ||
ERROR: You need to provide sequences to output sequences. | ||
[1] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Filter into two separate sets and then select sequences from the union of those sets. | ||
First, select strains from Brazil (there should be 1). | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --query "country == 'Brazil'" \ | ||
> --output-strains "$TMP/filtered_strains.brazil.txt" > /dev/null | ||
$ wc -l "$TMP/filtered_strains.brazil.txt" | ||
\s*1 .* (re) | ||
|
||
Then, select strains from Colombia (there should be 3). | ||
|
||
$ ${AUGUR} filter \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --query "country == 'Colombia'" \ | ||
> --output-strains "$TMP/filtered_strains.colombia.txt" > /dev/null | ||
$ wc -l "$TMP/filtered_strains.colombia.txt" | ||
\s*3 .* (re) | ||
|
||
Finally, exclude all sequences except those from the two sets of strains (there should be 4). | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequences filter/data/sequences.fasta \ | ||
> --sequence-index filter/data/sequence_index.tsv \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --exclude-all \ | ||
> --include "$TMP/filtered_strains.brazil.txt" "$TMP/filtered_strains.colombia.txt" \ | ||
> --output "$TMP/filtered.fasta" > /dev/null | ||
$ grep "^>" "$TMP/filtered.fasta" | wc -l | ||
\s*4 (re) | ||
$ rm -f "$TMP/filtered.fasta" | ||
|
||
Repeat this filter without a sequence index. | ||
We should get the same outputs without building a sequence index on the fly, because the exclude-all flag tells us we only want to force-include strains and skip all other filters. | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequences filter/data/sequences.fasta \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --exclude-all \ | ||
> --include "$TMP/filtered_strains.brazil.txt" "$TMP/filtered_strains.colombia.txt" \ | ||
> --output "$TMP/filtered.fasta" \ | ||
> --output-metadata "$TMP/filtered.tsv" > /dev/null | ||
$ grep "^>" "$TMP/filtered.fasta" | wc -l | ||
\s*4 (re) | ||
$ rm -f "$TMP/filtered.fasta" | ||
|
||
Metadata should have the same number of records as the sequences plus a header. | ||
|
||
$ wc -l "$TMP/filtered.tsv" | ||
\s*5 .* (re) | ||
$ rm -f "$TMP/filtered.tsv" | ||
|
||
Alternately, exclude the sequences from Brazil and Colombia (N=4) and records without sequences (N=1) or metadata (N=1). | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequences filter/data/sequences.fasta \ | ||
> --sequence-index filter/data/sequence_index.tsv \ | ||
> --metadata filter/data/metadata.tsv \ | ||
> --exclude "$TMP/filtered_strains.brazil.txt" "$TMP/filtered_strains.colombia.txt" \ | ||
> --output "$TMP/filtered.fasta" > /dev/null | ||
$ grep "^>" "$TMP/filtered.fasta" | wc -l | ||
\s*7 (re) | ||
$ rm -f "$TMP/filtered.fasta" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
Setup | ||
|
||
$ pushd "$TESTDIR" > /dev/null | ||
$ source _setup.sh | ||
|
||
Filter TB strains from VCF and save as a list of filtered strains. | ||
|
||
$ ${AUGUR} filter \ | ||
> --sequences filter/data/tb.vcf.gz \ | ||
> --metadata filter/data/tb_metadata.tsv \ | ||
> --min-date 2012 \ | ||
> --output-strains "$TMP/filtered_strains.txt" > /dev/null | ||
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`. | ||
$ wc -l "$TMP/filtered_strains.txt" | ||
\s*3 .* (re) | ||
$ rm -f "$TMP/filtered_strains.txt" |
Oops, something went wrong.