Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: step to export disease/target evidence #867

Merged
merged 14 commits into from
Oct 24, 2024
Merged

feat: step to export disease/target evidence #867

merged 14 commits into from
Oct 24, 2024

Conversation

DSuveges
Copy link
Contributor

✨ Context

This PR implements the following:

  • Evidence export method added to l2g dataset.
  • Step to export evidence.
  • Configuration for the step.

@DSuveges DSuveges changed the title feature(l2g): step to export disease/target evidence feat(l2g): step to export disease/target evidence Oct 22, 2024
@DSuveges DSuveges changed the title feat(l2g): step to export disease/target evidence feat: step to export disease/target evidence Oct 22, 2024
@DSuveges DSuveges linked an issue Oct 22, 2024 that may be closed by this pull request
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Daniel!
I have some questions, please have a look

class LocusToGeneEvidenceStepConfig(StepConfig):
"""Configuration of the locus to gene evidence step."""

session: Any = field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to specify the session in the config? Hail is set to false by default

study_locus: StudyLocus,
study_index: StudyIndex,
l2g_threshold: float = 0.05,
) -> DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Virtually this is another source of evidence generation, and it is not going to go through validation unlike the others.
Maybe it's annoying, but I think it'd be nice to include a new data model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this... and not convinced. Evidence doesn't have any downstream application, no methods are called on evidence dataset... Especially considering we only have five or so columns, of which two are constant. I'm quite inclined not to have an other dataset. Later maybe... eg. in case if the generation of the VEP input requires it.


return (
self.df.filter(f.col("score") >= l2g_threshold)
.join(study_locus.df, on="studyLocusId", how="inner")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the join lighter

Suggested change
.join(study_locus.df, on="studyLocusId", how="inner")
.join(study_locus.df.select("studyLocusId", "studyId"), on="studyLocusId", how="inner")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect spark optimizes this in the background, but it's better to be more explicit.

return (
self.df.filter(f.col("score") >= l2g_threshold)
.join(study_locus.df, on="studyLocusId", how="inner")
.join(study_index.df.drop("geneId"), on="studyId", how="inner")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the join lighter

Suggested change
.join(study_index.df.drop("geneId"), on="studyId", how="inner")
.join(study_index.df.select("studyId", "diseaseIds"), on="studyId", how="inner")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same.

f.col("geneId").alias("targetFromSourceId"),
f.explode(f.col("diseaseIds")).alias("diseaseFromSourceMappedId"),
f.col("score").alias("resourceScore"),
f.col("studyLocusId").alias("studyLocusId"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant?

Suggested change
f.col("studyLocusId").alias("studyLocusId"),
f.col("studyLocusId"),

credible_set_path: str,
study_index_path: str,
evidence_output_path: str,
locus_to_gene_threshold: float = LocusToGeneEvidenceStepConfig().locus_to_gene_threshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i know now why you are specifying the session to the config.I had this scenario before. And if you do LocusToGeneEvidenceStepConfig(), you are going to trigger Spark to get or create (in this case get) the session, which is prob not desirable. Because of that, I suggest not to get the default from the config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you agree that the action here is not to pull the parameter from the config and remove the session reference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of the float you could not evaluate the LocusToGeneEvidenceStepConfig. Primitive types are defined at compile time if I am not mistaken.

credible_sets, study_index, locus_to_gene_threshold
)
.write.mode(session.write_mode)
.json(evidence_output_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON? Is the plan to add a step in the evidence parsers that pulls this dataset and validate it there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The evidence parsers are not expected to touch this dataset, however the platform etl, at its current form, ingests evidence as json. We can make is smarter later, but let's do one step at a time.

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Last request: please reference this new step in docs/python_api/steps/l2g.md

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 24, 2024
@DSuveges DSuveges merged commit d4b91d6 into dev Oct 24, 2024
5 checks passed
@DSuveges DSuveges deleted the ds_evidence_step branch October 24, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataset documentation Improvements or additions to documentation Feature size-M Step
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate gentropy-based L2G evidence
3 participants