-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove filtering of common strains during loading #200
Conversation
db67ce5
to
9b08108
Compare
666f018
to
827ae43
Compare
9b08108
to
256692a
Compare
827ae43
to
0d42e94
Compare
256692a
to
f7bd16d
Compare
0d42e94
to
a4be9e2
Compare
f7bd16d
to
ee755e0
Compare
a4be9e2
to
7794618
Compare
ee755e0
to
8517df0
Compare
7794618
to
a810870
Compare
8517df0
to
0f4bc7c
Compare
a810870
to
88d39a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something here, but if a strain is not relevant (i.e., it's not present in both genomic and metabolomic data), what is the use case in which it could be used for the scoring part anyway? Which means, how can a link be present (and scored) at all in such cases?
Good point @gcroci2, I think it depends on how we organize the loading of the strains - and the matching. This is one of the most critical steps as so far, most of the issues arise from this. @CunliangGeng, what is your current idea? First load everything and then ensure there are matches in a second step, before moving on to the scoring? |
@gcroci2 @justinjjvanderhooft If a strain is not a common strain, it could be used by only genomics or only metabolomics or neither. The current workflow will load everything and pass them to scoring part, users then set what strains to use in the scoring part, e.g. there would be a parameter in scoring like |
@CunliangGeng I think I can follow your logic, as in, we load all that is there, and then prior to scoring we let the user choose/assess what will be considered....is that how you envisioned it? I think what @gcroci2 and myself are referring to is that, for any "linking score" to work, we need at least 1 (and preferably more!) "common" strains between the genome and metabolome data that is uploaded. For the PoDP data, this is not an issue, as they are matched through the PoDP metadata. For the local data (where many users will start with), what do we currently have in mind? Upon preparing the data, do a validation that at least one strain label is matching (common) between the genome and metabolome data? And/or give an overview of common and "not-matching" strains (labels) prior to calculating the score(s). It is good that you raise this point. I am happy to follow your advice/logic @CunliangGeng, as long as it is clear to the user in an early stage how many matching (common) strains there are available for any scoring. Whilst it is still unclear to me now what we can do with the not matching strain labels, I can think of enough use cases (i.e., streamlined loading of genome data for networking, etc.) - is that what you had in mind @CunliangGeng? |
@justinjjvanderhooft @gcroci2 If users set it to use common strains for scoring and actually there is no common strains, then it should raise an warning/error. We can provide methods in class
For the local data mode, users have to manually prepare the strain mappings and ensure the strain id has alias for both genomics and metabolomics data. After loading the strain mappings (and also other data), users can use the new method in
For the data like BGC that do not have a matched strain, they won't be passed to scoring stage. In the loading stage, the number of matched/unmatched data will be reported in the log. |
Thanks for your answers @gcroci2! I think it may also be a "semantic" confusion, but I think I can follow the logic now. Let's get back to this in an online / in person meeting to clarify everything :-) |
@gcroci2 Please explicitly approve the PR if it looks good. Without approval from at least one reviewer, the PR cannot be merged :-) |
Merge activity
|
0f4bc7c
to
f0235cb
Compare
The loading part should return all relevant strains to scoring part. It should allow users to specify if they want to filter common strains or not in the scoring part (to be implemented in the class `NPLinker`).
88d39a5
to
d1adc66
Compare
The loading part should return all relevant strains to scoring part. And then users should do the filtering of strains in the scoring part. So removing the method
_filter_only_common_strains
from loading process.The upcoming PRs will add the filtering of common strains in the scoring part.