Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add genetics ETL step to generate association data #3599

Closed
DSuveges opened this issue Oct 29, 2024 · 0 comments · Fixed by opentargets/gentropy#888 or opentargets/orchestration#64
Closed
Assignees
Labels
Data Relates to Open Targets data team Genetics Relates to Open Targets genetics team

Comments

@DSuveges
Copy link

The genetics ETL aggregates l2g prediction and study data to build disease/target evidence. This dataset is picked up by the platform ETL to integrate with other evidence sources to build disease/target association. To evaluate the performance of l2g prediction it is desirable to work with the association data, however that requires a full ETL run, which makes l2g iteration very slow.

The aim of this issue is to add a step to gentropy, and submsequently add one more task to the ETL orchestration to build direct and indirect evidence dataset.

  • Direct associations: take evidence, groupby target/disease and apply a harmonic sum on the l2g scores.
  • Indirect associations: take evidence join with diease index, explode parent terms, group by target/parent disease and apply a harmonic sum on the l2g scores.

These two datasets needs to be saved as parquet files together with other ETL output. Important: this dataset is not ingested by the platform ETL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Relates to Open Targets data team Genetics Relates to Open Targets genetics team
Projects
None yet
2 participants