-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: step to export disease/target evidence #867
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Daniel!
I have some questions, please have a look
src/gentropy/config.py
Outdated
class LocusToGeneEvidenceStepConfig(StepConfig): | ||
"""Configuration of the locus to gene evidence step.""" | ||
|
||
session: Any = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to specify the session in the config? Hail is set to false by default
study_locus: StudyLocus, | ||
study_index: StudyIndex, | ||
l2g_threshold: float = 0.05, | ||
) -> DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Virtually this is another source of evidence generation, and it is not going to go through validation unlike the others.
Maybe it's annoying, but I think it'd be nice to include a new data model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this... and not convinced. Evidence doesn't have any downstream application, no methods are called on evidence dataset... Especially considering we only have five or so columns, of which two are constant. I'm quite inclined not to have an other dataset. Later maybe... eg. in case if the generation of the VEP input requires it.
|
||
return ( | ||
self.df.filter(f.col("score") >= l2g_threshold) | ||
.join(study_locus.df, on="studyLocusId", how="inner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the join lighter
.join(study_locus.df, on="studyLocusId", how="inner") | |
.join(study_locus.df.select("studyLocusId", "studyId"), on="studyLocusId", how="inner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect spark optimizes this in the background, but it's better to be more explicit.
return ( | ||
self.df.filter(f.col("score") >= l2g_threshold) | ||
.join(study_locus.df, on="studyLocusId", how="inner") | ||
.join(study_index.df.drop("geneId"), on="studyId", how="inner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the join lighter
.join(study_index.df.drop("geneId"), on="studyId", how="inner") | |
.join(study_index.df.select("studyId", "diseaseIds"), on="studyId", how="inner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same.
f.col("geneId").alias("targetFromSourceId"), | ||
f.explode(f.col("diseaseIds")).alias("diseaseFromSourceMappedId"), | ||
f.col("score").alias("resourceScore"), | ||
f.col("studyLocusId").alias("studyLocusId"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant?
f.col("studyLocusId").alias("studyLocusId"), | |
f.col("studyLocusId"), |
src/gentropy/l2g.py
Outdated
credible_set_path: str, | ||
study_index_path: str, | ||
evidence_output_path: str, | ||
locus_to_gene_threshold: float = LocusToGeneEvidenceStepConfig().locus_to_gene_threshold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i know now why you are specifying the session to the config.I had this scenario before. And if you do LocusToGeneEvidenceStepConfig()
, you are going to trigger Spark to get or create (in this case get) the session, which is prob not desirable. Because of that, I suggest not to get the default from the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you agree that the action here is not to pull the parameter from the config and remove the session reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case of the float you could not evaluate the LocusToGeneEvidenceStepConfig
. Primitive types are defined at compile time if I am not mistaken.
credible_sets, study_index, locus_to_gene_threshold | ||
) | ||
.write.mode(session.write_mode) | ||
.json(evidence_output_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON? Is the plan to add a step in the evidence parsers that pulls this dataset and validate it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The evidence parsers are not expected to touch this dataset, however the platform etl, at its current form, ingests evidence as json. We can make is smarter later, but let's do one step at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Last request: please reference this new step in docs/python_api/steps/l2g.md
✨ Context
This PR implements the following: