Use less dtype inference when reading metadata into DataFrames #1252

victorlin · 2023-07-06T18:17:31Z

Description of proposed changes

Support passing the dtype parameter to pandas.read_csv in augur.io.read_metadata, and use it to not infer data types in subcommands that don't need it.

Internal references to read_metadata that were not updated due to broad use of data types:

export v1, v2: These support arbitrary columns for coloring. Data types should be inferred on all columns to ensure they are converted to JSON properly.

Related issue(s)

Prompted by Set date column to be string #1235 (comment)

Testing

Added a test for numerical queries that rely on dtypes
Checks pass

Checklist

Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

codecov · 2023-07-06T19:04:39Z

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (e17bd07) 66.69% compared to head (2a90aab) 66.28%.
Report is 12 commits behind head on master.

❗ Current head 2a90aab differs from pull request most recent head 7e81765. Consider uploading reports for the commit 7e81765 to get more accurate results

Files	Patch %	Lines
augur/io/metadata.py	63.63%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1252      +/-   ##
==========================================
- Coverage   66.69%   66.28%   -0.42%     
==========================================
  Files          69       69              
  Lines        7321     7269      -52     
  Branches     1797     1784      -13     
==========================================
- Hits         4883     4818      -65     
- Misses       2170     2181      +11     
- Partials      268      270       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

augur/io/metadata.py

augur/frequencies.py

augur/filter/include_exclude_rules.py

CHANGES.md

augur/filter/include_exclude_rules.py

huddlej · 2024-01-05T17:49:26Z

traits: This support arbitrary columns (via --columns) to perform discrete reconstruction on. I assume that numerical values as a numerical type might be useful here, so data types should be inferred on all columns.

I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. If we ever add support for continuous trait analysis, we would need to provide a way to explicitly enable type inference for columns. In the context of this specific command, that interface could look like --discrete-columns and --continuous-columns, if we wanted to support some combination of both inference modes in the same command.

From a slightly later comment on a different issue with the same topic, @victorlin noted that:

The general consensus was that we should use the "string" dtype for everything to avoid having to worry about how pandas automatically handles dtypes under the hood.

@victorlin @joverlee521 What do you think about revisiting the approach in this specific PR vs. some other approach?

victorlin · 2024-01-09T23:40:03Z

@huddlej thanks for letting me know about augur traits, I'll add it to this PR.

I need to update this with @joverlee521's suggestion to replace infer_dtypes with the more extensible dtype, then this should be ready for re-review.

@huddlej

@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>

@huddlej

@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>

victorlin · 2024-01-13T00:08:48Z

@huddlej added augur traits as 45983b2 and changelog entry:

traits: Previously, columns with only numeric values were treated as numerical data. These are now treated as categorical data for discrete trait analysis.

victorlin · 2024-01-13T00:13:24Z

augur/frequencies.py

+        # TODO: load only the ID and date columns when read_metadata supports
+        # loading a subset of all columns.


This is implemented in #1294 as ac23e80

huddlej

Thanks, @victorlin! With the latest changes in this PR, augur traits does exactly what I would expect. I'm pleasantly surprised that switching the dtype to "string" throughout augur (except export) hasn't broken any subcommands that silently depended on pandas type inference.

For posterity, could you update the PR description to no longer mention augur traits as a module you omitted?

augur/io/metadata.py

@huddlej

@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>

@huddlej

@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>

victorlin · 2024-01-19T20:03:32Z

I'm pleasantly surprised that switching the dtype to "string" throughout augur (except export) hasn't broken any subcommands that silently depended on pandas type inference.

I checked the subcommands to make sure that they don't depend on pandas type inference: e05038d...a65aa00

Looking back again, I'm not sure about frequencies. @huddlej since you authored 6548579, can you confirm that it's fine for all metadata to be str here? EDIT: asked on Slack

augur/augur/frequencies.py

Lines 117 to 120 in df9e8de

    
           # Annotate tips with metadata to enable filtering and weighting of 
        
           # frequencies by metadata attributes. 
        
           for key, value in metadata.loc[tip.name].items(): 
        
               tip.attr[key] = value

For posterity, could you update the PR description to no longer mention augur traits as a module you omitted?

Done!

huddlej · 2024-01-24T19:10:10Z

@victorlin Thanks for doublechecking about the use of metadata in the frequencies code. I can confirm that frequencies should not be affected by setting the default type to string. In addition to the date information that we parse from metadata, we also use metadata to weigh frequencies of individual records based on their values in a specific metadata column (e.g., region or country). We expect the values in the metadata and the weights provided by the user to be strings.

That said, because we could need a specific metadata field for this weighting process, this comment about only loading a subset of fields should probably mention that we'd need to optionally include the metadata column associated with the --weights-attribute value, when the user provides it.

Previously, dtype inference was done on all columns except the 2 that had predefined dtypes here. This option can be used to avoid dtype inference in cases where dtypes are known or need not be inferred. The default value of None maintains default behavior.

The existing test was good for comparisons between columns and numerical constants, but did not cover comparisons between two numerical columns.

Data types are unused except for the --query interface to support queries such as numerical comparisons. That filter function already attempts its own type inference to support numerical queries.

Only the strain and date columns are used. Skipping dtype inference on other columns should result in faster reading of metadata. A more optimal change would be to skip loading those unused columns entirely, but that requires another change in read_metadata.

Metadata is used for strain and date columns, which are already read as string. Other columns are used for KDE estimation. Skipping dtype inference on other columns should result in faster reading of metadata. A more optimal change would be to skip loading those unused columns entirely, but that requires another change in read_metadata.

@huddlej

@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>

victorlin · 2024-01-24T23:41:48Z

@huddlej thanks for the explanation!

I've updated the comment to:

# TODO: load only the ID, date, and --weights-attribute columns when
# read_metadata supports loading a subset of all columns.

victorlin self-assigned this Jul 6, 2023

victorlin force-pushed the victorlin/read-metadata-dtypes branch from 04dd509 to 6908e81 Compare July 6, 2023 19:09

victorlin marked this pull request as ready for review July 6, 2023 20:49

victorlin requested a review from a team July 6, 2023 20:49

This was referenced Jul 6, 2023

Set date column to be string #1235

Merged

io/read_metadata: ignore pandas DtypeWarning #1238

Merged

joverlee521 reviewed Jul 8, 2023

View reviewed changes

augur/io/metadata.py Show resolved Hide resolved

augur/io/metadata.py Outdated Show resolved Hide resolved

augur/frequencies.py Outdated Show resolved Hide resolved

augur/filter/include_exclude_rules.py Outdated Show resolved Hide resolved

victorlin force-pushed the victorlin/read-metadata-dtypes branch 2 times, most recently from 50e8fbc to d5f80ef Compare July 10, 2023 19:29

This was referenced Jul 28, 2023

filter: Try converting all queried columns to numerical type #1268

Merged

filter: --query fails when numerical comparisons are used on columns with missing values #1269

Closed

victorlin marked this pull request as draft August 10, 2023 20:26

victorlin force-pushed the victorlin/read-metadata-dtypes branch from d5f80ef to bad49a1 Compare August 11, 2023 03:04

victorlin commented Aug 11, 2023

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

augur/filter/include_exclude_rules.py Outdated Show resolved Hide resolved

victorlin marked this pull request as ready for review August 11, 2023 03:14

victorlin requested a review from joverlee521 August 11, 2023 03:14

victorlin marked this pull request as draft January 9, 2024 23:40

victorlin force-pushed the victorlin/read-metadata-dtypes branch from bad49a1 to 343b2a8 Compare January 12, 2024 23:39

victorlin force-pushed the victorlin/read-metadata-dtypes branch from 343b2a8 to 2a4c3a8 Compare January 12, 2024 23:54

victorlin force-pushed the victorlin/read-metadata-dtypes branch from 2a4c3a8 to 98e82ba Compare January 12, 2024 23:58

victorlin marked this pull request as ready for review January 13, 2024 00:08

victorlin requested a review from huddlej January 13, 2024 00:09

victorlin commented Jan 13, 2024

View reviewed changes

huddlej approved these changes Jan 19, 2024

View reviewed changes

augur/io/metadata.py Outdated Show resolved Hide resolved

victorlin force-pushed the victorlin/read-metadata-dtypes branch from 98e82ba to 4795a0b Compare January 19, 2024 19:40

victorlin force-pushed the victorlin/read-metadata-dtypes branch from 4795a0b to 2a90aab Compare January 19, 2024 20:02

victorlin mentioned this pull request Jan 19, 2024

Read a subset of metadata columns #1294

Merged

5 tasks

victorlin added 7 commits January 24, 2024 15:37

filter: Add another test for numerical queries

696517d

The existing test was good for comparisons between columns and numerical constants, but did not cover comparisons between two numerical columns.

filter: Read all metadata as string type

9f9be3a

Data types are unused except for the --query interface to support queries such as numerical comparisons. That filter function already attempts its own type inference to support numerical queries.

Update changelog

7e81765

victorlin force-pushed the victorlin/read-metadata-dtypes branch from 2a90aab to 7e81765 Compare January 24, 2024 23:40

victorlin merged commit cae562a into master Jan 24, 2024
11 of 22 checks passed

victorlin deleted the victorlin/read-metadata-dtypes branch January 24, 2024 23:43

victorlin mentioned this pull request Jan 25, 2024

frequencies: Annotate tips with the minimum necessary information #1398

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use less dtype inference when reading metadata into DataFrames #1252

Use less dtype inference when reading metadata into DataFrames #1252

victorlin commented Jul 6, 2023 •

edited

Loading

codecov bot commented Jul 6, 2023 •

edited

Loading

huddlej commented Jan 5, 2024

victorlin commented Jan 9, 2024

victorlin commented Jan 13, 2024

victorlin Jan 13, 2024

huddlej left a comment

victorlin commented Jan 19, 2024 •

edited

Loading

huddlej commented Jan 24, 2024

victorlin commented Jan 24, 2024

		# TODO: load only the ID and date columns when read_metadata supports
		# loading a subset of all columns.

Use less dtype inference when reading metadata into DataFrames #1252

Use less dtype inference when reading metadata into DataFrames #1252

Conversation

victorlin commented Jul 6, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Testing

Checklist

codecov bot commented Jul 6, 2023 • edited Loading

Codecov Report

huddlej commented Jan 5, 2024

victorlin commented Jan 9, 2024

victorlin commented Jan 13, 2024

victorlin Jan 13, 2024

Choose a reason for hiding this comment

huddlej left a comment

Choose a reason for hiding this comment

victorlin commented Jan 19, 2024 • edited Loading

huddlej commented Jan 24, 2024

victorlin commented Jan 24, 2024

victorlin commented Jul 6, 2023 •

edited

Loading

codecov bot commented Jul 6, 2023 •

edited

Loading

victorlin commented Jan 19, 2024 •

edited

Loading