remove filtering of common strains during loading #200

CunliangGeng · 2023-12-12T15:33:58Z

The loading part should return all relevant strains to scoring part. And then users should do the filtering of strains in the scoring part. So removing the method _filter_only_common_strains from loading process.

The upcoming PRs will add the filtering of common strains in the scoring part.

CunliangGeng · 2023-12-12T15:34:26Z

Current dependencies on/for this PR:

dev
- PR refactor filtering of user specified strains #178
  - PR Enable GCF loader to filter singleton GCFs #181
    - PR return list for get_bgcs methods of BGC loaders #182
      - PR rename strains.py to strain.py #183
        
        PR remove deprecated functions of loading genomics data #184
        
        PR update the process of loading genomics data #185
        
        PR rename genomics.py to utils.py #186
        
        PR remove deprecated functions of loading metabolomics data #187
        
        PR refactor class MolecularFamily #188
        PR refactor class GCF #189
        PR enable parallel testing using pytest-xdist #190
        PR Enable filtering of singleton molecular families #191
        PR Remove class SingletonFamily and MF attribute id #192
        PR refactor Spectrum class #193
        PR remove unused mgf.py and its test file #194
        PR Update logics of loading metabolomics data #196
        PR Update loading of genomics data #197
        PR deprecate the method _load_class_info #198
        PR remove loading of optional data #199
        PR remove filtering of common strains during loading #200 👈
        PR update logics of loading mibig data #201
        PR refactor initiation of paths to metabolomics files #202
        
        PR Update logics of loading metabolomics data #195

This stack of pull requests is managed by Graphite.

gcroci2

Maybe I'm missing something here, but if a strain is not relevant (i.e., it's not present in both genomic and metabolomic data), what is the use case in which it could be used for the scoring part anyway? Which means, how can a link be present (and scored) at all in such cases?

justinjjvanderhooft · 2023-12-21T15:59:45Z

Good point @gcroci2, I think it depends on how we organize the loading of the strains - and the matching. This is one of the most critical steps as so far, most of the issues arise from this. @CunliangGeng, what is your current idea? First load everything and then ensure there are matches in a second step, before moving on to the scoring?

CunliangGeng · 2023-12-21T17:21:27Z

@gcroci2 @justinjjvanderhooft If a strain is not a common strain, it could be used by only genomics or only metabolomics or neither. The current workflow will load everything and pass them to scoring part, users then set what strains to use in the scoring part, e.g. there would be a parameter in scoring like use_only_common_strains and/or filter_specified_strains...

justinjjvanderhooft · 2023-12-22T10:58:46Z

@CunliangGeng I think I can follow your logic, as in, we load all that is there, and then prior to scoring we let the user choose/assess what will be considered....is that how you envisioned it? I think what @gcroci2 and myself are referring to is that, for any "linking score" to work, we need at least 1 (and preferably more!) "common" strains between the genome and metabolome data that is uploaded. For the PoDP data, this is not an issue, as they are matched through the PoDP metadata. For the local data (where many users will start with), what do we currently have in mind? Upon preparing the data, do a validation that at least one strain label is matching (common) between the genome and metabolome data? And/or give an overview of common and "not-matching" strains (labels) prior to calculating the score(s). It is good that you raise this point. I am happy to follow your advice/logic @CunliangGeng, as long as it is clear to the user in an early stage how many matching (common) strains there are available for any scoring. Whilst it is still unclear to me now what we can do with the not matching strain labels, I can think of enough use cases (i.e., streamlined loading of genome data for networking, etc.) - is that what you had in mind @CunliangGeng?

CunliangGeng · 2024-01-24T14:01:58Z

@justinjjvanderhooft @gcroci2
Here are my answers to Justin's questions:

If users set it to use common strains for scoring and actually there is no common strains, then it should raise an warning/error.

We can provide methods in class NPLinker to check if common strains exist or not, or to check the number of common/non-common strains.

For the local data (where many users will start with), what do we currently have in mind? Upon preparing the data, do a validation that at least one strain label is matching (common) between the genome and metabolome data?

For the local data mode, users have to manually prepare the strain mappings and ensure the strain id has alias for both genomics and metabolomics data. After loading the strain mappings (and also other data), users can use the new method in NPLinker class to check the number of different types of strains.

Whilst it is still unclear to me now what we can do with the not matching strain labels, I can think of enough use cases (i.e., streamlined loading of genome data for networking, etc.) - is that what you had in mind

For the data like BGC that do not have a matched strain, they won't be passed to scoring stage. In the loading stage, the number of matched/unmatched data will be reported in the log.

justinjjvanderhooft · 2024-01-24T14:50:47Z

Thanks for your answers @gcroci2! I think it may also be a "semantic" confusion, but I think I can follow the logic now. Let's get back to this in an online / in person meeting to clarify everything :-)

CunliangGeng · 2024-01-24T15:04:11Z

@gcroci2 Please explicitly approve the PR if it looks good. Without approval from at least one reviewer, the PR cannot be merged :-)

CunliangGeng · 2024-01-24T15:20:39Z

Merge activity

Jan 24, 10:20 AM: @CunliangGeng started a stack merge that includes this pull request via Graphite.
Jan 24, 10:28 AM: Graphite rebased this pull request as part of a merge.
Jan 24, 10:29 AM: @CunliangGeng merged this pull request with Graphite.

The loading part should return all relevant strains to scoring part. It should allow users to specify if they want to filter common strains or not in the scoring part (to be implemented in the class `NPLinker`).

CunliangGeng mentioned this pull request Dec 12, 2023

remove loading of optional data #199

Merged

CunliangGeng requested a review from gcroci2 December 12, 2023 15:49

CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from db67ce5 to 9b08108 Compare December 14, 2023 08:33

CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 666f018 to 827ae43 Compare December 14, 2023 08:34

CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from 9b08108 to 256692a Compare December 14, 2023 10:00

CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 827ae43 to 0d42e94 Compare December 14, 2023 10:01

CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from 256692a to f7bd16d Compare December 19, 2023 13:21

CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 0d42e94 to a4be9e2 Compare December 19, 2023 13:21

CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from f7bd16d to ee755e0 Compare December 19, 2023 13:38

CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from a4be9e2 to 7794618 Compare December 19, 2023 13:38

CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from ee755e0 to 8517df0 Compare December 20, 2023 11:53

CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 7794618 to a810870 Compare December 20, 2023 11:53

CunliangGeng mentioned this pull request Dec 20, 2023

update logics of loading mibig data #201

Merged

CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from 8517df0 to 0f4bc7c Compare December 20, 2023 12:24

CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from a810870 to 88d39a5 Compare December 20, 2023 12:24

CunliangGeng mentioned this pull request Dec 21, 2023

refactor initiation of paths to metabolomics files #202

Merged

gcroci2 reviewed Dec 21, 2023

View reviewed changes

gcroci2 self-requested a review January 24, 2024 15:07

gcroci2 approved these changes Jan 24, 2024

View reviewed changes

CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from 0f4bc7c to f0235cb Compare January 24, 2024 15:25

Base automatically changed from 12-12-remove_loading_optional_data to dev January 24, 2024 15:27

remove method _filter_only_common_strains

d1adc66

The loading part should return all relevant strains to scoring part. It should allow users to specify if they want to filter common strains or not in the scoring part (to be implemented in the class `NPLinker`).

CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 88d39a5 to d1adc66 Compare January 24, 2024 15:28

CunliangGeng merged commit 4de30e3 into dev Jan 24, 2024
1 of 2 checks passed

CunliangGeng deleted the 12-12-remove-filtering-common-strains branch January 24, 2024 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove filtering of common strains during loading #200

remove filtering of common strains during loading #200

CunliangGeng commented Dec 12, 2023 •

edited

Loading

CunliangGeng commented Dec 12, 2023 •

edited

Loading

gcroci2 left a comment •

edited

Loading

justinjjvanderhooft commented Dec 21, 2023

CunliangGeng commented Dec 21, 2023

justinjjvanderhooft commented Dec 22, 2023

CunliangGeng commented Jan 24, 2024

justinjjvanderhooft commented Jan 24, 2024

CunliangGeng commented Jan 24, 2024

CunliangGeng commented Jan 24, 2024 •

edited

Loading

remove filtering of common strains during loading #200

remove filtering of common strains during loading #200

Conversation

CunliangGeng commented Dec 12, 2023 • edited Loading

CunliangGeng commented Dec 12, 2023 • edited Loading

gcroci2 left a comment • edited Loading

Choose a reason for hiding this comment

justinjjvanderhooft commented Dec 21, 2023

CunliangGeng commented Dec 21, 2023

justinjjvanderhooft commented Dec 22, 2023

CunliangGeng commented Jan 24, 2024

justinjjvanderhooft commented Jan 24, 2024

CunliangGeng commented Jan 24, 2024

CunliangGeng commented Jan 24, 2024 • edited Loading

Merge activity

CunliangGeng commented Dec 12, 2023 •

edited

Loading

CunliangGeng commented Dec 12, 2023 •

edited

Loading

gcroci2 left a comment •

edited

Loading

CunliangGeng commented Jan 24, 2024 •

edited

Loading