Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use less dtype inference when reading metadata into DataFrames #1252

Merged
merged 7 commits into from
Jan 24, 2024

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Jul 6, 2023

Description of proposed changes

Support passing the dtype parameter to pandas.read_csv in augur.io.read_metadata, and use it to not infer data types in subcommands that don't need it.

Internal references to read_metadata that were not updated due to broad use of data types:

  • export v1, v2: These support arbitrary columns for coloring. Data types should be inferred on all columns to ensure they are converted to JSON properly.

Related issue(s)

Testing

  • Added a test for numerical queries that rely on dtypes
  • Checks pass

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

@victorlin victorlin self-assigned this Jul 6, 2023
@codecov
Copy link

codecov bot commented Jul 6, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (e17bd07) 66.69% compared to head (2a90aab) 66.28%.
Report is 12 commits behind head on master.

❗ Current head 2a90aab differs from pull request most recent head 7e81765. Consider uploading reports for the commit 7e81765 to get more accurate results

Files Patch % Lines
augur/io/metadata.py 63.63% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1252      +/-   ##
==========================================
- Coverage   66.69%   66.28%   -0.42%     
==========================================
  Files          69       69              
  Lines        7321     7269      -52     
  Branches     1797     1784      -13     
==========================================
- Hits         4883     4818      -65     
- Misses       2170     2181      +11     
- Partials      268      270       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@victorlin victorlin force-pushed the victorlin/read-metadata-dtypes branch from 04dd509 to 6908e81 Compare July 6, 2023 19:09
@victorlin victorlin marked this pull request as ready for review July 6, 2023 20:49
@victorlin victorlin requested a review from a team July 6, 2023 20:49
augur/io/metadata.py Show resolved Hide resolved
augur/io/metadata.py Outdated Show resolved Hide resolved
augur/frequencies.py Outdated Show resolved Hide resolved
augur/filter/include_exclude_rules.py Outdated Show resolved Hide resolved
CHANGES.md Outdated Show resolved Hide resolved
augur/filter/include_exclude_rules.py Outdated Show resolved Hide resolved
@victorlin victorlin marked this pull request as ready for review August 11, 2023 03:14
@huddlej
Copy link
Contributor

huddlej commented Jan 5, 2024

traits: This support arbitrary columns (via --columns) to perform discrete reconstruction on. I assume that numerical values as a numerical type might be useful here, so data types should be inferred on all columns.

I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. If we ever add support for continuous trait analysis, we would need to provide a way to explicitly enable type inference for columns. In the context of this specific command, that interface could look like --discrete-columns and --continuous-columns, if we wanted to support some combination of both inference modes in the same command.

From a slightly later comment on a different issue with the same topic, @victorlin noted that:

The general consensus was that we should use the "string" dtype for everything to avoid having to worry about how pandas automatically handles dtypes under the hood.

@victorlin @joverlee521 What do you think about revisiting the approach in this specific PR vs. some other approach?

@victorlin
Copy link
Member Author

@huddlej thanks for letting me know about augur traits, I'll add it to this PR.

I need to update this with @joverlee521's suggestion to replace infer_dtypes with the more extensible dtype, then this should be ready for re-review.

@victorlin victorlin marked this pull request as draft January 9, 2024 23:40
@victorlin victorlin force-pushed the victorlin/read-metadata-dtypes branch from bad49a1 to 343b2a8 Compare January 12, 2024 23:39
victorlin added a commit that referenced this pull request Jan 12, 2024
@huddlej noted:¹

    I just ran into this issue where I need to run augur traits on values
    that are numeric (integers) but should be treated as strings (the
    integers are actually cluster labels, so they identify groups of samples
    and shouldn't be treated like numbers). Since augur traits only supports
    discrete trait analysis, I would expect no type inference from the
    metadata prior to running DTA.

¹ <#1252 (comment)>
@victorlin victorlin force-pushed the victorlin/read-metadata-dtypes branch from 343b2a8 to 2a4c3a8 Compare January 12, 2024 23:54
victorlin added a commit that referenced this pull request Jan 12, 2024
@huddlej noted:¹

    I just ran into this issue where I need to run augur traits on values
    that are numeric (integers) but should be treated as strings (the
    integers are actually cluster labels, so they identify groups of samples
    and shouldn't be treated like numbers). Since augur traits only supports
    discrete trait analysis, I would expect no type inference from the
    metadata prior to running DTA.

¹ <#1252 (comment)>
@victorlin victorlin force-pushed the victorlin/read-metadata-dtypes branch from 2a4c3a8 to 98e82ba Compare January 12, 2024 23:58
@victorlin
Copy link
Member Author

@huddlej added augur traits as 45983b2 and changelog entry:

traits: Previously, columns with only numeric values were treated as numerical data. These are now treated as categorical data for discrete trait analysis.

@victorlin victorlin marked this pull request as ready for review January 13, 2024 00:08
Comment on lines 88 to 89
# TODO: load only the ID and date columns when read_metadata supports
# loading a subset of all columns.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implemented in #1294 as ac23e80

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @victorlin! With the latest changes in this PR, augur traits does exactly what I would expect. I'm pleasantly surprised that switching the dtype to "string" throughout augur (except export) hasn't broken any subcommands that silently depended on pandas type inference.

For posterity, could you update the PR description to no longer mention augur traits as a module you omitted?

augur/io/metadata.py Outdated Show resolved Hide resolved
victorlin added a commit that referenced this pull request Jan 19, 2024
@huddlej noted:¹

    I just ran into this issue where I need to run augur traits on values
    that are numeric (integers) but should be treated as strings (the
    integers are actually cluster labels, so they identify groups of samples
    and shouldn't be treated like numbers). Since augur traits only supports
    discrete trait analysis, I would expect no type inference from the
    metadata prior to running DTA.

¹ <#1252 (comment)>
@victorlin victorlin force-pushed the victorlin/read-metadata-dtypes branch from 98e82ba to 4795a0b Compare January 19, 2024 19:40
victorlin added a commit that referenced this pull request Jan 19, 2024
@huddlej noted:¹

    I just ran into this issue where I need to run augur traits on values
    that are numeric (integers) but should be treated as strings (the
    integers are actually cluster labels, so they identify groups of samples
    and shouldn't be treated like numbers). Since augur traits only supports
    discrete trait analysis, I would expect no type inference from the
    metadata prior to running DTA.

¹ <#1252 (comment)>
@victorlin victorlin force-pushed the victorlin/read-metadata-dtypes branch from 4795a0b to 2a90aab Compare January 19, 2024 20:02
@victorlin
Copy link
Member Author

victorlin commented Jan 19, 2024

I'm pleasantly surprised that switching the dtype to "string" throughout augur (except export) hasn't broken any subcommands that silently depended on pandas type inference.

I checked the subcommands to make sure that they don't depend on pandas type inference: e05038d...a65aa00

Looking back again, I'm not sure about frequencies. @huddlej since you authored 6548579, can you confirm that it's fine for all metadata to be str here? EDIT: asked on Slack

augur/augur/frequencies.py

Lines 117 to 120 in df9e8de

# Annotate tips with metadata to enable filtering and weighting of
# frequencies by metadata attributes.
for key, value in metadata.loc[tip.name].items():
tip.attr[key] = value

For posterity, could you update the PR description to no longer mention augur traits as a module you omitted?

Done!

@victorlin victorlin mentioned this pull request Jan 19, 2024
5 tasks
@huddlej
Copy link
Contributor

huddlej commented Jan 24, 2024

@victorlin Thanks for doublechecking about the use of metadata in the frequencies code. I can confirm that frequencies should not be affected by setting the default type to string. In addition to the date information that we parse from metadata, we also use metadata to weigh frequencies of individual records based on their values in a specific metadata column (e.g., region or country). We expect the values in the metadata and the weights provided by the user to be strings.

That said, because we could need a specific metadata field for this weighting process, this comment about only loading a subset of fields should probably mention that we'd need to optionally include the metadata column associated with the --weights-attribute value, when the user provides it.

Previously, dtype inference was done on all columns except the 2 that
had predefined dtypes here. This option can be used to avoid dtype
inference in cases where dtypes are known or need not be inferred.

The default value of None maintains default behavior.
The existing test was good for comparisons between columns and numerical
constants, but did not cover comparisons between two numerical columns.
Data types are unused except for the --query interface to support
queries such as numerical comparisons. That filter function already
attempts its own type inference to support numerical queries.
Only the strain and date columns are used. Skipping dtype inference on
other columns should result in faster reading of metadata.

A more optimal change would be to skip loading those unused columns
entirely, but that requires another change in read_metadata.
Metadata is used for strain and date columns, which are already read as
string. Other columns are used for KDE estimation. Skipping dtype
inference on other columns should result in faster reading of metadata.

A more optimal change would be to skip loading those unused columns
entirely, but that requires another change in read_metadata.
@huddlej noted:¹

    I just ran into this issue where I need to run augur traits on values
    that are numeric (integers) but should be treated as strings (the
    integers are actually cluster labels, so they identify groups of samples
    and shouldn't be treated like numbers). Since augur traits only supports
    discrete trait analysis, I would expect no type inference from the
    metadata prior to running DTA.

¹ <#1252 (comment)>
@victorlin victorlin force-pushed the victorlin/read-metadata-dtypes branch from 2a90aab to 7e81765 Compare January 24, 2024 23:40
@victorlin
Copy link
Member Author

@huddlej thanks for the explanation!

I've updated the comment to:

# TODO: load only the ID, date, and --weights-attribute columns when
# read_metadata supports loading a subset of all columns.

@victorlin victorlin merged commit cae562a into master Jan 24, 2024
11 of 22 checks passed
@victorlin victorlin deleted the victorlin/read-metadata-dtypes branch January 24, 2024 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants