-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use less dtype inference when reading metadata into DataFrames #1252
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1252 +/- ##
==========================================
- Coverage 66.69% 66.28% -0.42%
==========================================
Files 69 69
Lines 7321 7269 -52
Branches 1797 1784 -13
==========================================
- Hits 4883 4818 -65
- Misses 2170 2181 +11
- Partials 268 270 +2 ☔ View full report in Codecov by Sentry. |
04dd509
to
6908e81
Compare
50e8fbc
to
d5f80ef
Compare
d5f80ef
to
bad49a1
Compare
I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. If we ever add support for continuous trait analysis, we would need to provide a way to explicitly enable type inference for columns. In the context of this specific command, that interface could look like From a slightly later comment on a different issue with the same topic, @victorlin noted that:
@victorlin @joverlee521 What do you think about revisiting the approach in this specific PR vs. some other approach? |
@huddlej thanks for letting me know about I need to update this with @joverlee521's suggestion to replace |
bad49a1
to
343b2a8
Compare
@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>
343b2a8
to
2a4c3a8
Compare
@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>
2a4c3a8
to
98e82ba
Compare
augur/frequencies.py
Outdated
# TODO: load only the ID and date columns when read_metadata supports | ||
# loading a subset of all columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @victorlin! With the latest changes in this PR, augur traits does exactly what I would expect. I'm pleasantly surprised that switching the dtype to "string" throughout augur (except export) hasn't broken any subcommands that silently depended on pandas type inference.
For posterity, could you update the PR description to no longer mention augur traits
as a module you omitted?
@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>
98e82ba
to
4795a0b
Compare
@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>
4795a0b
to
2a90aab
Compare
I checked the subcommands to make sure that they don't depend on pandas type inference: e05038d...a65aa00 Looking back again, I'm not sure about frequencies. @huddlej since you authored 6548579, can you confirm that it's fine for all metadata to be Lines 117 to 120 in df9e8de
Done! |
@victorlin Thanks for doublechecking about the use of metadata in the frequencies code. I can confirm that frequencies should not be affected by setting the default type to That said, because we could need a specific metadata field for this weighting process, this comment about only loading a subset of fields should probably mention that we'd need to optionally include the metadata column associated with the |
Previously, dtype inference was done on all columns except the 2 that had predefined dtypes here. This option can be used to avoid dtype inference in cases where dtypes are known or need not be inferred. The default value of None maintains default behavior.
The existing test was good for comparisons between columns and numerical constants, but did not cover comparisons between two numerical columns.
Data types are unused except for the --query interface to support queries such as numerical comparisons. That filter function already attempts its own type inference to support numerical queries.
Only the strain and date columns are used. Skipping dtype inference on other columns should result in faster reading of metadata. A more optimal change would be to skip loading those unused columns entirely, but that requires another change in read_metadata.
Metadata is used for strain and date columns, which are already read as string. Other columns are used for KDE estimation. Skipping dtype inference on other columns should result in faster reading of metadata. A more optimal change would be to skip loading those unused columns entirely, but that requires another change in read_metadata.
@huddlej noted:¹ I just ran into this issue where I need to run augur traits on values that are numeric (integers) but should be treated as strings (the integers are actually cluster labels, so they identify groups of samples and shouldn't be treated like numbers). Since augur traits only supports discrete trait analysis, I would expect no type inference from the metadata prior to running DTA. ¹ <#1252 (comment)>
2a90aab
to
7e81765
Compare
@huddlej thanks for the explanation! I've updated the comment to:
|
Description of proposed changes
Support passing the
dtype
parameter topandas.read_csv
inaugur.io.read_metadata
, and use it to not infer data types in subcommands that don't need it.Internal references to
read_metadata
that were not updated due to broad use of data types:Related issue(s)
date
column to bestring
#1235 (comment)Testing
Checklist