Skip to content

Commit

Permalink
merge: Support changing the names of, or omitting entirely, the gener…
Browse files Browse the repository at this point in the history
…ated source columns

This lets us more easily use `augur merge` in places where it makes no
sense to include the generated source columns (e.g. in the Nextclade
metadata merge step of our workflows) and in places where we have
existing source column names we want to match (e.g. in ncov, replacing
the bespoke combine_metadata.py).
  • Loading branch information
tsibley committed Sep 6, 2024
1 parent 81db604 commit 3c32b99
Show file tree
Hide file tree
Showing 3 changed files with 123 additions and 4 deletions.
5 changes: 5 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,17 @@

## __NEXT__

### Features

* merge: Generated source columns (e.g. `__source_metadata_{NAME}`) may now have their name template changed with `--source-columns=TEMPLATE` or may be omitted entirely with `--no-source-columns`. [#1625][] (@tsibley)

### Bug Fixes

* filter: Previously, when `--subsample-max-sequences` was slightly lower than the number of groups, it was possible to fail with an uncaught `AssertionError`. Internal calculations have been adjusted to prevent this from happening. [#1588][] [#1598][] (@victorlin)

[#1588]: https://github.com/nextstrain/augur/issues/1588
[#1598]: https://github.com/nextstrain/augur/issues/1598
[#1625]: https://github.com/nextstrain/augur/issues/1625

## 25.4.0 (3 September 2024)

Expand Down
54 changes: 50 additions & 4 deletions augur/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@
table to identify the source of each row's data. Column names are generated
as "__source_metadata_{NAME}" where "{NAME}" is the table name given to
--metadata. Values in each column are 1 or 0 for present or absent in that
input table.
input table. You may change the generated column names by providing your own
template with --source-columns or omit these columns entirely with
--no-source-columns.
Metadata tables of arbitrary size can be handled, limited only by available
disk space. Tables are not required to be entirely loadable into memory. The
Expand Down Expand Up @@ -89,6 +91,8 @@ def register_parser(parent_subparsers):

output_group = parser.add_argument_group("outputs", "options related to output")
output_group.add_argument('--output-metadata', required=True, metavar="FILE", help="Required. Merged metadata as TSV. Compressed files are supported." + SKIP_AUTO_DEFAULT_IN_HELP)
output_group.add_argument('--source-columns', default="__source_metadata_{NAME}", metavar="TEMPLATE", help=f"Template with which to generate names for the columns (described above) identifying the source of each row's data. Must contain a literal placeholder, {{NAME}}, which stands in for the metadata table names assigned in --metadata.")
output_group.add_argument('--no-source-columns', dest="source_columns", action="store_const", const=None, help=f"Suppress generated columns (described above) identifying the source of each row's data." + SKIP_AUTO_DEFAULT_IN_HELP)
output_group.add_argument('--quiet', action="store_true", default=False, help="Suppress informational and warning messages normally written to stderr. (default: disabled)" + SKIP_AUTO_DEFAULT_IN_HELP)

return parser
Expand Down Expand Up @@ -149,6 +153,22 @@ def run(args):
"""))


# Validate --source-columns template and convert to template function
output_source_column = None

if args.source_columns is not None:
if "{NAME}" not in args.source_columns:
raise AugurError(dedent(f"""\
The --source-columns template must contain the literal
placeholder {{NAME}} but the given value ({args.source_columns!r}) does not.
You may need to quote the whole template value to prevent your
shell from interpreting the placeholder before Augur sees it.
"""))

output_source_column = lambda name: args.source_columns.replace('{NAME}', name)


# Infer delimiters and id columns
metadata = [
NamedMetadata(name, path, [delim for name_, delim in metadata_delimiters if not name_ or name_ == name] or DEFAULT_DELIMITERS,
Expand Down Expand Up @@ -195,6 +215,30 @@ def run(args):
Renaming may be done with `augur curate rename`.
"""))

output_source_columns = set(
output_source_column(m.name)
for m in metadata
if output_source_column)

if conflicting_columns := [f"{c!r} in metadata table {m.name!r}"
for m in metadata
for c in m.columns
if c in output_source_columns]:
raise AugurError(dedent(f"""\
Generated source column names may not conflict with any column
names in metadata inputs.
The given source column template ({args.source_columns!r}) with the
given metadata table names would conflict with the following input
{_n("column", "columns", len(conflicting_columns))}:
{indented_list(conflicting_columns, ' ' + ' ')}
Please adjust the source column template with --source-columns
and/or adjust the metadata table names to avoid conflicts.
"""))


try:
# Read all metadata files into a SQLite db
for m in metadata:
Expand Down Expand Up @@ -245,9 +289,11 @@ def run(args):
*(f"""coalesce({', '.join(f"nullif({x}, '')" for x in starmap(sqlite_quote_id, reversed(input_columns)))}, null) as {sqlite_quote_id(output_column)}"""
for output_column, input_columns in output_columns.items()),

# Source columns
*(f"""{sqlite_quote_id(m.table_name, m.id_column)} is not null as {sqlite_quote_id(f'__source_metadata_{m.name}')}"""
for m in metadata)]
# Source columns. Select expressions generated here instead of
# earlier to stay adjacent to the join conditions below, upon which
# these rely.
*(f"""{sqlite_quote_id(m.table_name, m.id_column)} is not null as {sqlite_quote_id(output_source_column(m.name))}"""
for m in metadata if output_source_column)]

from_list = [
sqlite_quote_id(metadata[0].table_name),
Expand Down
68 changes: 68 additions & 0 deletions tests/functional/merge/cram/merge.t
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,28 @@ Metadata field values with metachars (field or record delimiters) are handled pr
x" 1 1
two X2a X2b X2c 1 1

Source columns template.

$ ${AUGUR} merge \
> --metadata X=x.tsv Y=y.tsv \
> --source-columns 'origin_{NAME}' \
> --output-metadata - --quiet | csv2tsv --csv-delim $'\t' | tsv-pretty
strain a b c f e d origin_X origin_Y
one X1a X1b X1c 1 0
two X2a X2b Y2c Y2f Y2e Y2d 1 1
three Y3f Y3e Y3d 0 1

No source columns.

$ ${AUGUR} merge \
> --metadata X=x.tsv Y=y.tsv \
> --no-source-columns \
> --output-metadata - --quiet | csv2tsv --csv-delim $'\t' | tsv-pretty
strain a b c f e d
one X1a X1b X1c
two X2a X2b Y2c Y2f Y2e Y2d
three Y3f Y3e Y3d


ERROR HANDLING

Expand Down Expand Up @@ -325,6 +347,52 @@ Non-id column names conflicting with output id column name.

[2]

Invalid source columns template.

$ ${AUGUR} merge \
> --metadata X=x.tsv Y=y.tsv \
> --source-columns 'nope' \
> --output-metadata /dev/null --quiet
ERROR: The --source-columns template must contain the literal
placeholder {NAME} but the given value ('nope') does not.

You may need to quote the whole template value to prevent your
shell from interpreting the placeholder before Augur sees it.

[2]

$ ${AUGUR} merge \
> --metadata X=x.tsv Y=y.tsv \
> --source-columns '' \
> --output-metadata /dev/null --quiet
ERROR: The --source-columns template must contain the literal
placeholder {NAME} but the given value ('') does not.

You may need to quote the whole template value to prevent your
shell from interpreting the placeholder before Augur sees it.

[2]

$ ${AUGUR} merge \
> --metadata a=x.tsv b=y.tsv \
> --source-columns '{NAME}' \
> --output-metadata /dev/null --quiet
ERROR: Generated source column names may not conflict with any column
names in metadata inputs.

The given source column template ('{NAME}') with the
given metadata table names would conflict with the following input
columns:

'a' in metadata table 'a'
'b' in metadata table 'a'
'b' in metadata table 'b'

Please adjust the source column template with --source-columns
and/or adjust the metadata table names to avoid conflicts.

[2]

SQLITE3 env var can be used to override `sqlite3` location (and failure is
handled).

Expand Down

0 comments on commit 3c32b99

Please sign in to comment.