Skip to content

Commit

Permalink
Merge branch 'trs/merge/source-columns'
Browse files Browse the repository at this point in the history
  • Loading branch information
tsibley committed Sep 10, 2024
2 parents 1f8fa35 + 3c32b99 commit db54927
Show file tree
Hide file tree
Showing 3 changed files with 123 additions and 4 deletions.
5 changes: 5 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,17 @@

## __NEXT__

### Features

* merge: Generated source columns (e.g. `__source_metadata_{NAME}`) may now have their name template changed with `--source-columns=TEMPLATE` or may be omitted entirely with `--no-source-columns`. [#1625][] (@tsibley)

### Bug Fixes

* filter: Previously, when `--subsample-max-sequences` was slightly lower than the number of groups, it was possible to fail with an uncaught `AssertionError`. Internal calculations have been adjusted to prevent this from happening. [#1588][] [#1598][] (@victorlin)

[#1588]: https://github.com/nextstrain/augur/issues/1588
[#1598]: https://github.com/nextstrain/augur/issues/1598
[#1625]: https://github.com/nextstrain/augur/issues/1625

## 25.4.0 (3 September 2024)

Expand Down
54 changes: 50 additions & 4 deletions augur/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@
table to identify the source of each row's data. Column names are generated
as "__source_metadata_{NAME}" where "{NAME}" is the table name given to
--metadata. Values in each column are 1 or 0 for present or absent in that
input table.
input table. You may change the generated column names by providing your own
template with --source-columns or omit these columns entirely with
--no-source-columns.
Metadata tables of arbitrary size can be handled, limited only by available
disk space. Tables are not required to be entirely loadable into memory. The
Expand Down Expand Up @@ -89,6 +91,8 @@ def register_parser(parent_subparsers):

output_group = parser.add_argument_group("outputs", "options related to output")
output_group.add_argument('--output-metadata', required=True, metavar="FILE", help="Required. Merged metadata as TSV. Compressed files are supported." + SKIP_AUTO_DEFAULT_IN_HELP)
output_group.add_argument('--source-columns', default="__source_metadata_{NAME}", metavar="TEMPLATE", help=f"Template with which to generate names for the columns (described above) identifying the source of each row's data. Must contain a literal placeholder, {{NAME}}, which stands in for the metadata table names assigned in --metadata.")
output_group.add_argument('--no-source-columns', dest="source_columns", action="store_const", const=None, help=f"Suppress generated columns (described above) identifying the source of each row's data." + SKIP_AUTO_DEFAULT_IN_HELP)
output_group.add_argument('--quiet', action="store_true", default=False, help="Suppress informational and warning messages normally written to stderr. (default: disabled)" + SKIP_AUTO_DEFAULT_IN_HELP)

return parser
Expand Down Expand Up @@ -149,6 +153,22 @@ def run(args):
"""))


# Validate --source-columns template and convert to template function
output_source_column = None

if args.source_columns is not None:
if "{NAME}" not in args.source_columns:
raise AugurError(dedent(f"""\
The --source-columns template must contain the literal
placeholder {{NAME}} but the given value ({args.source_columns!r}) does not.
You may need to quote the whole template value to prevent your
shell from interpreting the placeholder before Augur sees it.
"""))

output_source_column = lambda name: args.source_columns.replace('{NAME}', name)


# Infer delimiters and id columns
metadata = [
NamedMetadata(name, path, [delim for name_, delim in metadata_delimiters if not name_ or name_ == name] or DEFAULT_DELIMITERS,
Expand Down Expand Up @@ -195,6 +215,30 @@ def run(args):
Renaming may be done with `augur curate rename`.
"""))

output_source_columns = set(
output_source_column(m.name)
for m in metadata
if output_source_column)

if conflicting_columns := [f"{c!r} in metadata table {m.name!r}"
for m in metadata
for c in m.columns
if c in output_source_columns]:
raise AugurError(dedent(f"""\
Generated source column names may not conflict with any column
names in metadata inputs.
The given source column template ({args.source_columns!r}) with the
given metadata table names would conflict with the following input
{_n("column", "columns", len(conflicting_columns))}:
{indented_list(conflicting_columns, ' ' + ' ')}
Please adjust the source column template with --source-columns
and/or adjust the metadata table names to avoid conflicts.
"""))


try:
# Read all metadata files into a SQLite db
for m in metadata:
Expand Down Expand Up @@ -245,9 +289,11 @@ def run(args):
*(f"""coalesce({', '.join(f"nullif({x}, '')" for x in starmap(sqlite_quote_id, reversed(input_columns)))}, null) as {sqlite_quote_id(output_column)}"""
for output_column, input_columns in output_columns.items()),

# Source columns
*(f"""{sqlite_quote_id(m.table_name, m.id_column)} is not null as {sqlite_quote_id(f'__source_metadata_{m.name}')}"""
for m in metadata)]
# Source columns. Select expressions generated here instead of
# earlier to stay adjacent to the join conditions below, upon which
# these rely.
*(f"""{sqlite_quote_id(m.table_name, m.id_column)} is not null as {sqlite_quote_id(output_source_column(m.name))}"""
for m in metadata if output_source_column)]

from_list = [
sqlite_quote_id(metadata[0].table_name),
Expand Down
68 changes: 68 additions & 0 deletions tests/functional/merge/cram/merge.t
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,28 @@ Metadata field values with metachars (field or record delimiters) are handled pr
x" 1 1
two X2a X2b X2c 1 1

Source columns template.

$ ${AUGUR} merge \
> --metadata X=x.tsv Y=y.tsv \
> --source-columns 'origin_{NAME}' \
> --output-metadata - --quiet | csv2tsv --csv-delim $'\t' | tsv-pretty
strain a b c f e d origin_X origin_Y
one X1a X1b X1c 1 0
two X2a X2b Y2c Y2f Y2e Y2d 1 1
three Y3f Y3e Y3d 0 1

No source columns.

$ ${AUGUR} merge \
> --metadata X=x.tsv Y=y.tsv \
> --no-source-columns \
> --output-metadata - --quiet | csv2tsv --csv-delim $'\t' | tsv-pretty
strain a b c f e d
one X1a X1b X1c
two X2a X2b Y2c Y2f Y2e Y2d
three Y3f Y3e Y3d


ERROR HANDLING

Expand Down Expand Up @@ -325,6 +347,52 @@ Non-id column names conflicting with output id column name.

[2]

Invalid source columns template.

$ ${AUGUR} merge \
> --metadata X=x.tsv Y=y.tsv \
> --source-columns 'nope' \
> --output-metadata /dev/null --quiet
ERROR: The --source-columns template must contain the literal
placeholder {NAME} but the given value ('nope') does not.

You may need to quote the whole template value to prevent your
shell from interpreting the placeholder before Augur sees it.

[2]

$ ${AUGUR} merge \
> --metadata X=x.tsv Y=y.tsv \
> --source-columns '' \
> --output-metadata /dev/null --quiet
ERROR: The --source-columns template must contain the literal
placeholder {NAME} but the given value ('') does not.

You may need to quote the whole template value to prevent your
shell from interpreting the placeholder before Augur sees it.

[2]

$ ${AUGUR} merge \
> --metadata a=x.tsv b=y.tsv \
> --source-columns '{NAME}' \
> --output-metadata /dev/null --quiet
ERROR: Generated source column names may not conflict with any column
names in metadata inputs.

The given source column template ('{NAME}') with the
given metadata table names would conflict with the following input
columns:

'a' in metadata table 'a'
'b' in metadata table 'a'
'b' in metadata table 'b'

Please adjust the source column template with --source-columns
and/or adjust the metadata table names to avoid conflicts.

[2]

SQLITE3 env var can be used to override `sqlite3` location (and failure is
handled).

Expand Down

0 comments on commit db54927

Please sign in to comment.