-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): adding new variant annotation model #641
Conversation
…rgets/gentropy into ds_3333_new_variant_index
…rgets/gentropy into ds_3333_new_variant_index
f.col("Term").alias("label"), | ||
f.col("v2g_score").cast("double").alias("score"), | ||
) | ||
).withColumn("score", f.col("score").cast("double")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the score file is read, the schema not right. The scores need to be cast to double.
"type": "string", | ||
"nullable": false, | ||
"metadata": {} | ||
"nullable": true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"nullable": true, | |
"nullable": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments and let me know your thoughts
Thanks for the thorough review. I think I could answer most of your questions and implement the suggestions where I could. |
…_3333_new_variant_index
…rgets/gentropy into ds_3333_new_variant_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing all my comments.
I'll approve, but please look at my comment about the joining strategy when bringing the frequencies to the index table.
* feat(variant annotation): new variant annotation schema + logic to extract from VEP * fix: typehints in function * refactor(variant annotation): migrating methods to the new schema * chore: pre-commit auto fixes [...] * refactor(variant index): sorting out new variant index dataset * chore: pre-commit auto fixes [...] * feature(vep): adding predictors to vep transcript object * fix(schema): fixing schema missing fields * fix(schema): fixing schema missing fields * fix(schema): fixing schema missing fields * fix(schema): fixing schema missing fields * chore: pre-commit auto fixes [...] * fix(annotation): array union under condition * fix: merging dbxref objects * feat(variants): updating variants to make more robust * feat: migrating methods to new variant index * adjusting variant index methods * some updates * rename v2g to variant to gene * chore: pre-commit auto fixes [...] * adding test * chore: pre-commit auto fixes [...] * fix(precommit): json file needed to rename to jsonl * fix(precommit): removing steps depending on old data model * fix(coftest): fixing variant index mock generation * fix: typo in package import * fix: sorting out conftest * refactor(gwas ingest): Updating GnomAD handling * refactor(gnomad): variant annotation removed, changed to variant index, steps updated * refactor: shuffling around gnomad logic * fix: references in tests * refactor: sorting out gnomad variant dag * refactor: cleaning configs and tests * docs(vep): adding datasource description * test(vep): adding more test to the vep parser * test(vep): tests are now running * fix: removing version suffix from pyproject and airflow config * fix: reverting DAGs - removing temporary modifications I added for testing * fix: addressing reviewer comments * refactor: fiddling with variant index annotation logic * chore: addressing comments * fix: variant cross-ref snake case * fix: correcting join strategy --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
✨ Context
Variant representation replaced from the GnomAD variant annotation to a schema tailored with the Platform variant representation in mind. The data for the new variant model is derived from VEP output. Essentially the idea is that we'll run VEP for every variant we have phenotypic data for every release for the most up to date Ensembl version.
🛠 What does this PR implement
New variant index
Removal of variant annotation - A number of steps were required to eradicate variant annotation from gentropy. There's no need for it anymore, as the variant index data model will contain all required annotation.
Migrating variant annotation to variant index - As there are multiple dependencies on Variant Annotation dataset, these steps needed to be reviewed and amended to make sure no downstream process would fail.
New data source - Variant annotation is parsed from Ensembl's VEP output. Therefore we can consider Ensembl as a separate datasource, where potentially other parsers can be added (eg. parser for rs id to variant id).
Pipelining
Tests
Other refactoring along the way
vep_consequences.tsv
file got extended with new consequence terms and changed the format slightly to no transformation is required when used.🙈 Missing
vep_output_path: gs://genetics_etl_python_playground/vep/full_variant_index_vcf
) is not tidy. This needs to be refactored.🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?