Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): adding new variant annotation model #641

Merged
merged 57 commits into from
Jun 30, 2024

Conversation

DSuveges
Copy link
Contributor

@DSuveges DSuveges commented Jun 12, 2024

✨ Context

Variant representation replaced from the GnomAD variant annotation to a schema tailored with the Platform variant representation in mind. The data for the new variant model is derived from VEP output. Essentially the idea is that we'll run VEP for every variant we have phenotypic data for every release for the most up to date Ensembl version.

🛠 What does this PR implement

New variant index

  • New data model.
  • New schema.
  • Relevant dataset specific methods migrated to new data model from old variant index and the old variant annotation.
  • To be consistent with the rest of the Platform design, instead of variant consequence terms, we are using sequence ontology identifiers. To do the mapping, a mapping file had to be added as a static asset.

Removal of variant annotation - A number of steps were required to eradicate variant annotation from gentropy. There's no need for it anymore, as the variant index data model will contain all required annotation.

  • Removing schema.
  • Removing dataset module.
  • Removing dataset documentation.
  • Walked through most of the code and documentation to remove references to variant annotation. This effort might not be complete.

Migrating variant annotation to variant index - As there are multiple dependencies on Variant Annotation dataset, these steps needed to be reviewed and amended to make sure no downstream process would fail.

  • GnomAD ingestion step now produces variant index dataset, which can be picked up by the variant index step to bring in cross references to GnomAD and allele frequencies.
  • V2G generation now depends on the new variant index.
  • The ingestion of GWAS Catalog curated dataset picks up the variant index generated from GnomAD variants.

New data source - Variant annotation is parsed from Ensembl's VEP output. Therefore we can consider Ensembl as a separate datasource, where potentially other parsers can be added (eg. parser for rs id to variant id).

  • New datasource added to gentropy with the logic to parse JSON formatted VEP output.
  • Documentation was added.
  • As the VEP output json can be very big but also can miss all columns where no data is available, it makes sense to load the data with providing the schema. Providing the schema ensures parsing methods to not fail. The schema was put to static assets.

Pipelining

  • All relevant configuration was updated to remove variant annotation and use the new variant index wherever it was necessary.
  • Step was added to generate variant index based on the Ensembl VEP output.
  • All configuration was removed related to variant annotation.
  • GWAS Catalog ingestion DAGs were also updated to use the GnomAD variant set.

Tests

  • Tests were updated to point to the new classes and new tests were written to test the VEP parser.
  • Airflow steps and configs were tested: GnomAD variant ingestion. Variant index generation, V2G generation.

Other refactoring along the way

  • Removal of unused sample dataset in the tests folder.
  • Updating doc strings when found inaccuracies.
  • I have added a line to VSCode config to suppress extension recommendations, as I found them overly annoying.
  • The vep_consequences.tsv file got extended with new consequence terms and changed the format slightly to no transformation is required when used.
  • As there's no variant annotation, I removed the step, but at the same time, I put together the two GnomAD preprocess step into one single DAG.

🙈 Missing

  • !! This PR doesn't include logic or workflow to generate the variant list for the index
  • !! The logic and workflow to generate VEP annotation for the variant is also in a separate PR.
  • !! The current place for VEP output (vep_output_path: gs://genetics_etl_python_playground/vep/full_variant_index_vcf) is not tidy. This needs to be refactored.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)? - not really.
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@github-actions github-actions bot added the Step label Jun 19, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 19, 2024
@DSuveges DSuveges marked this pull request as ready for review June 25, 2024 06:59
f.col("Term").alias("label"),
f.col("v2g_score").cast("double").alias("score"),
)
).withColumn("score", f.col("score").cast("double"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the score file is read, the schema not right. The scores need to be cast to double.

"type": "string",
"nullable": false,
"metadata": {}
"nullable": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"nullable": true,
"nullable": false,

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments and let me know your thoughts

src/gentropy/assets/schemas/variant_index.json Outdated Show resolved Hide resolved
src/gentropy/assets/schemas/variant_index.json Outdated Show resolved Hide resolved
src/gentropy/assets/schemas/variant_index.json Outdated Show resolved Hide resolved
src/gentropy/assets/schemas/variant_index.json Outdated Show resolved Hide resolved
src/gentropy/datasource/ensembl/vep_parser.py Outdated Show resolved Hide resolved
src/gentropy/datasource/gnomad/variants.py Show resolved Hide resolved
src/gentropy/datasource/gwas_catalog/associations.py Outdated Show resolved Hide resolved
src/gentropy/gnomad_ingestion.py Show resolved Hide resolved
src/gentropy/variant_index.py Outdated Show resolved Hide resolved
@DSuveges
Copy link
Contributor Author

Thanks for the thorough review. I think I could answer most of your questions and implement the suggestions where I could.

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing all my comments.
I'll approve, but please look at my comment about the joining strategy when bringing the frequencies to the index table.

@DSuveges DSuveges merged commit f79c789 into dev Jun 30, 2024
4 checks passed
@DSuveges DSuveges deleted the ds_3333_new_variant_index branch June 30, 2024 20:16
project-defiant pushed a commit that referenced this pull request Jul 12, 2024
* feat(variant annotation): new variant annotation schema + logic to extract from VEP

* fix: typehints in function

* refactor(variant annotation): migrating methods to the new schema

* chore: pre-commit auto fixes [...]

* refactor(variant index): sorting out new variant index dataset

* chore: pre-commit auto fixes [...]

* feature(vep): adding predictors to vep transcript object

* fix(schema): fixing schema missing fields

* fix(schema): fixing schema missing fields

* fix(schema): fixing schema missing fields

* fix(schema): fixing schema missing fields

* chore: pre-commit auto fixes [...]

* fix(annotation): array union under condition

* fix: merging dbxref objects

* feat(variants): updating variants to make more robust

* feat: migrating methods to new variant index

* adjusting variant index methods

* some updates

* rename v2g to variant to gene

* chore: pre-commit auto fixes [...]

* adding test

* chore: pre-commit auto fixes [...]

* fix(precommit): json file needed to rename to jsonl

* fix(precommit): removing steps depending on old data model

* fix(coftest): fixing variant index mock generation

* fix: typo in package import

* fix: sorting out conftest

* refactor(gwas ingest): Updating GnomAD handling

* refactor(gnomad): variant annotation removed, changed to variant index, steps updated

* refactor: shuffling around gnomad logic

* fix: references in tests

* refactor: sorting out gnomad variant dag

* refactor: cleaning configs and tests

* docs(vep): adding datasource description

* test(vep): adding more test to the vep parser

* test(vep): tests are now running

* fix: removing version suffix from pyproject and airflow config

* fix: reverting DAGs - removing temporary modifications I added for testing

* fix: addressing reviewer comments

* refactor: fiddling with variant index annotation logic

* chore: addressing comments

* fix: variant cross-ref snake case

* fix: correcting join strategy

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants