feat(datasets): adding new variant annotation model #641

DSuveges · 2024-06-12T18:51:53Z

✨ Context

Variant representation replaced from the GnomAD variant annotation to a schema tailored with the Platform variant representation in mind. The data for the new variant model is derived from VEP output. Essentially the idea is that we'll run VEP for every variant we have phenotypic data for every release for the most up to date Ensembl version.

🛠 What does this PR implement

New variant index

New data model.
New schema.
Relevant dataset specific methods migrated to new data model from old variant index and the old variant annotation.
To be consistent with the rest of the Platform design, instead of variant consequence terms, we are using sequence ontology identifiers. To do the mapping, a mapping file had to be added as a static asset.

Removal of variant annotation - A number of steps were required to eradicate variant annotation from gentropy. There's no need for it anymore, as the variant index data model will contain all required annotation.

Removing schema.
Removing dataset module.
Removing dataset documentation.
Walked through most of the code and documentation to remove references to variant annotation. This effort might not be complete.

Migrating variant annotation to variant index - As there are multiple dependencies on Variant Annotation dataset, these steps needed to be reviewed and amended to make sure no downstream process would fail.

GnomAD ingestion step now produces variant index dataset, which can be picked up by the variant index step to bring in cross references to GnomAD and allele frequencies.
V2G generation now depends on the new variant index.
The ingestion of GWAS Catalog curated dataset picks up the variant index generated from GnomAD variants.

New data source - Variant annotation is parsed from Ensembl's VEP output. Therefore we can consider Ensembl as a separate datasource, where potentially other parsers can be added (eg. parser for rs id to variant id).

New datasource added to gentropy with the logic to parse JSON formatted VEP output.
Documentation was added.
As the VEP output json can be very big but also can miss all columns where no data is available, it makes sense to load the data with providing the schema. Providing the schema ensures parsing methods to not fail. The schema was put to static assets.

Pipelining

All relevant configuration was updated to remove variant annotation and use the new variant index wherever it was necessary.
Step was added to generate variant index based on the Ensembl VEP output.
All configuration was removed related to variant annotation.
GWAS Catalog ingestion DAGs were also updated to use the GnomAD variant set.

Tests

Tests were updated to point to the new classes and new tests were written to test the VEP parser.
Airflow steps and configs were tested: GnomAD variant ingestion. Variant index generation, V2G generation.

Other refactoring along the way

Removal of unused sample dataset in the tests folder.
Updating doc strings when found inaccuracies.
I have added a line to VSCode config to suppress extension recommendations, as I found them overly annoying.
The vep_consequences.tsv file got extended with new consequence terms and changed the format slightly to no transformation is required when used.
As there's no variant annotation, I removed the step, but at the same time, I put together the two GnomAD preprocess step into one single DAG.

🙈 Missing

!! This PR doesn't include logic or workflow to generate the variant list for the index
!! The logic and workflow to generate VEP annotation for the variant is also in a separate PR.
!! The current place for VEP output (vep_output_path: gs://genetics_etl_python_playground/vep/full_variant_index_vcf) is not tidy. This needs to be refactored.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)? - not really.
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…tract from VEP

…rgets/gentropy into ds_3333_new_variant_index

…sting

DSuveges · 2024-06-25T08:58:22Z

src/gentropy/variant_to_gene.py

-            f.col("Term").alias("label"),
-            f.col("v2g_score").cast("double").alias("score"),
-        )
+        ).withColumn("score", f.col("score").cast("double"))


When the score file is read, the schema not right. The scores need to be cast to double.

ireneisdoomed · 2024-06-26T10:54:24Z

src/gentropy/assets/schemas/variant_index.json

-      "type": "string",
-      "nullable": false,
-      "metadata": {}
+      "nullable": true,


Suggested change

"nullable": true,

"nullable": false,

ireneisdoomed

See my comments and let me know your thoughts

src/gentropy/assets/schemas/variant_index.json

src/gentropy/datasource/ensembl/vep_parser.py

src/gentropy/datasource/gnomad/variants.py

src/gentropy/datasource/gwas_catalog/associations.py

src/gentropy/gnomad_ingestion.py

src/gentropy/variant_index.py

DSuveges · 2024-06-28T00:57:50Z

Thanks for the thorough review. I think I could answer most of your questions and implement the suggestions where I could.

…_3333_new_variant_index

…rgets/gentropy into ds_3333_new_variant_index

ireneisdoomed

Thank you for reviewing all my comments.
I'll approve, but please look at my comment about the joining strategy when bringing the frequencies to the index table.

…rgets/gentropy into ds_3333_new_variant_index

* feat(variant annotation): new variant annotation schema + logic to extract from VEP * fix: typehints in function * refactor(variant annotation): migrating methods to the new schema * chore: pre-commit auto fixes [...] * refactor(variant index): sorting out new variant index dataset * chore: pre-commit auto fixes [...] * feature(vep): adding predictors to vep transcript object * fix(schema): fixing schema missing fields * fix(schema): fixing schema missing fields * fix(schema): fixing schema missing fields * fix(schema): fixing schema missing fields * chore: pre-commit auto fixes [...] * fix(annotation): array union under condition * fix: merging dbxref objects * feat(variants): updating variants to make more robust * feat: migrating methods to new variant index * adjusting variant index methods * some updates * rename v2g to variant to gene * chore: pre-commit auto fixes [...] * adding test * chore: pre-commit auto fixes [...] * fix(precommit): json file needed to rename to jsonl * fix(precommit): removing steps depending on old data model * fix(coftest): fixing variant index mock generation * fix: typo in package import * fix: sorting out conftest * refactor(gwas ingest): Updating GnomAD handling * refactor(gnomad): variant annotation removed, changed to variant index, steps updated * refactor: shuffling around gnomad logic * fix: references in tests * refactor: sorting out gnomad variant dag * refactor: cleaning configs and tests * docs(vep): adding datasource description * test(vep): adding more test to the vep parser * test(vep): tests are now running * fix: removing version suffix from pyproject and airflow config * fix: reverting DAGs - removing temporary modifications I added for testing * fix: addressing reviewer comments * refactor: fiddling with variant index annotation logic * chore: addressing comments * fix: variant cross-ref snake case * fix: correcting join strategy --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

DSuveges added 2 commits June 12, 2024 19:49

feat(variant annotation): new variant annotation schema + logic to ex…

76cb983

…tract from VEP

fix: typehints in function

aa963d2

github-actions bot added Dataset Feature size-XL labels Jun 12, 2024

DSuveges and others added 4 commits June 14, 2024 11:12

refactor(variant annotation): migrating methods to the new schema

bbb18af

chore: pre-commit auto fixes [...]

4bfa2d4

refactor(variant index): sorting out new variant index dataset

7e6572d

Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…

053483e

…rgets/gentropy into ds_3333_new_variant_index

github-actions bot added the Datasource label Jun 14, 2024

pre-commit-ci bot and others added 14 commits June 14, 2024 13:58

chore: pre-commit auto fixes [...]

ea152df

feature(vep): adding predictors to vep transcript object

5c70a90

Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…

73103bc

…rgets/gentropy into ds_3333_new_variant_index

fix(schema): fixing schema missing fields

0e112ef

fix(schema): fixing schema missing fields

d92679b

fix(schema): fixing schema missing fields

bf975c6

fix(schema): fixing schema missing fields

b95cc09

chore: pre-commit auto fixes [...]

5b3c58c

fix(annotation): array union under condition

3690359

fix: resolving merge conflicts

6f48f7b

fix: merging dbxref objects

5e9e6fa

feat(variants): updating variants to make more robust

8225864

feat: migrating methods to new variant index

73ebc86

adjusting variant index methods

6a4f301

github-actions bot added the Step label Jun 19, 2024

DSuveges added 2 commits June 19, 2024 16:00

some updates

052446f

rename v2g to variant to gene

77eef57

github-actions bot added the documentation Improvements or additions to documentation label Jun 19, 2024

pre-commit-ci bot and others added 2 commits June 19, 2024 15:02

chore: pre-commit auto fixes [...]

1e53432

adding test

213e7d3

Merge branch 'dev' into ds_3333_new_variant_index

d8b8280

DSuveges marked this pull request as ready for review June 25, 2024 06:59

DSuveges added 3 commits June 25, 2024 09:26

fix: removing version suffix from pyproject and airflow config

caab094

Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…

5efa2b2

…rgets/gentropy into ds_3333_new_variant_index

fix: reverting DAGs - removing temporary modifications I added for te…

d3a2016

…sting

DSuveges commented Jun 25, 2024

View reviewed changes

DSuveges requested review from ireneisdoomed and d0choa June 25, 2024 08:59

ireneisdoomed reviewed Jun 26, 2024

View reviewed changes

Merge branch 'dev' into ds_3333_new_variant_index

841a83d

ireneisdoomed reviewed Jun 26, 2024

View reviewed changes

DSuveges added 3 commits June 27, 2024 09:41

Merge branch 'dev' into ds_3333_new_variant_index

a5a016b

fix: addressing reviewer comments

0339c25

refactor: fiddling with variant index annotation logic

d62e784

chore: addressing comments

f24062f

DSuveges requested a review from ireneisdoomed June 28, 2024 01:00

DSuveges added 4 commits June 28, 2024 13:42

Merge branch 'dev' into ds_3333_new_variant_index

6c84d1e

fix: variant cross-ref snake case

bdf38ae

Merge branch 'dev' of https://github.com/opentargets/gentropy into ds…

561a928

…_3333_new_variant_index

Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…

0c5d0cc

…rgets/gentropy into ds_3333_new_variant_index

ireneisdoomed approved these changes Jun 28, 2024

View reviewed changes

DSuveges added 3 commits June 28, 2024 17:38

Merge branch 'dev' into ds_3333_new_variant_index

edf5536

fix: correcting join strategy

4899fdf

Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…

407eec6

…rgets/gentropy into ds_3333_new_variant_index

DSuveges merged commit f79c789 into dev Jun 30, 2024
4 checks passed

DSuveges deleted the ds_3333_new_variant_index branch June 30, 2024 20:16

ireneisdoomed mentioned this pull request Jul 9, 2024

feat: full orchestration of the variant index dag #678

Merged

9 tasks

project-defiant mentioned this pull request Jul 29, 2024

fix: change config params to match new name #721

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): adding new variant annotation model #641

feat(datasets): adding new variant annotation model #641

DSuveges commented Jun 12, 2024 •

edited

Loading

DSuveges Jun 25, 2024

ireneisdoomed Jun 26, 2024

ireneisdoomed left a comment

DSuveges commented Jun 28, 2024

ireneisdoomed left a comment

feat(datasets): adding new variant annotation model #641

feat(datasets): adding new variant annotation model #641

Conversation

DSuveges commented Jun 12, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

DSuveges Jun 25, 2024

Choose a reason for hiding this comment

ireneisdoomed Jun 26, 2024

Choose a reason for hiding this comment

ireneisdoomed left a comment

Choose a reason for hiding this comment

DSuveges commented Jun 28, 2024

ireneisdoomed left a comment

Choose a reason for hiding this comment

DSuveges commented Jun 12, 2024 •

edited

Loading