feat(StudyIndex): validation for study type, disease, target etc #672

DSuveges · 2024-07-05T13:31:23Z

✨ Context

Logic to validate study index.

🛠 What does this PR implement

The following validation steps are implemented:

StudyIndex.validate_disease - validating diseases against the provided disease index (a view on the disease index).
StudyIndex.validate_study_type - flagging studies that are not gwas or some kind of qtls.
StudyIndex.validate_target - flagging qtl studies which doesnt' have valid Ensembl gene id, against target index.
StudyIndex.validate_unique_study_id - flagging studies with non-unique study identifiers.
Tests for all methods.

How it works:

qcd_study_index = (
    study_index
    .validate_disease(disease_map)
    .validate_target(target_index)
    .validate_study_type()
    .validate_unique_study_id()
)

Failing at any validation steps leads to adding a correspoding flag into the qualityControls column.

Heads up! - to enable the use of the existing flagging instruments from StudyLocus class, the necessary function (update_quality_flag) is moved to Dataset class, so all of our datasets are QC-able the same way.

🙈 Missing

In this PR the logic IS NOT organised into a step. There's no orchestration, just the business logic.
Gentropy has no disease index dataset. So there's no point in ingesting disease index as it is. So in the current implementation of the .validate_disease() method, a dataframe is expected with all the current and obsolete EFOs. Once we'll have a disease dataset class, this assumption can be changed.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…_3359_study_index_validation

DSuveges · 2024-07-05T14:04:30Z

Validation at work:

# Study index:
studies = (
    StudyIndex.from_parquet(session, "/Users/dsuveges/project_data/gentropy/study_index", recursiveFileLookup=True)
)

# Gene index:
gene_index = (
    GeneIndex.from_parquet(session, "/Users/dsuveges/project_data/gentropy/gene_index")
)

# Disease Index:
disease_map = (
    session.spark.read.parquet('/Users/dsuveges/project_data/gentropy/diseases')
    .select(
        f.col('id').alias('diseaseId'),
        f.explode_outer(
            f.when(
                f.col('obsoleteTerms').isNotNull(), 
                f.array_union(
                    f.array('id'), 
                    f.col('obsoleteTerms')
                )
            )
        ).alias('efo')
    )
    .withColumn(
        'efo',
        f.coalesce(f.col('efo'), f.col('diseaseId'))
    )
)


validated_studies = (
    studies
    .validate_disease(disease_map)
    .validate_target(gene_index)
    .validate_study_type()
    .validate_unique_study_id()
    .persist()
)

Out of 1,975,874, 14,871 studies are flagged:

+--------------+-----+
|     projectId|count|
+--------------+-----+
|  Nedelec_2016|   40|
|        OneK1K|   21|
|   Alasoo_2018|   30|
|          GTEx| 1317|
|   FINNGEN_R10| 4816|
|          GCST| 7670|
|   Nathan_2022|   38|
|        FUSION|   69|
|Schmiedel_2018|  119|
|    Cytoimmgen|   39|
|       GENCORD|   25|
|     BLUEPRINT|   82|
|      GEUVADIS|   22|
|    Lepik_2017|   26|
|    Quach_2016|  104|
|  Fairfax_2014|   15|
|        ROSMAP|   56|
|        HipSci|   40|
|      BrainSeq|   53|
|       TwinsUK|   83|
+--------------+-----+

Distribution of quality flags:

+----------------------------------------------------+-----+
|qualityControl                                      |count|
+----------------------------------------------------+-----+
|Failed summary statistics quality control           |472  |
|Target/gene identifier could not match to reference.|2385 |
|No valid disease identifier found.                  |11962|
|The identifier of this study is not unique.         |4838 |
|Non-additive model                                  |32   |
+----------------------------------------------------+-----+

As it can be seen from the labels, the flags are carried over from the pre-validated study indices.

src/gentropy/datasource/gwas_catalog/study_index.py

src/gentropy/assets/schemas/study_index.json

src/gentropy/dataset/dataset.py

src/gentropy/dataset/study_index.py

d0choa · 2024-07-11T08:33:00Z

src/gentropy/dataset/study_index.py

+            disease_map (DataFrame): Reference dataframe with diseases
+
+        Returns:
+            DataFrame: where the disease column name will contain the


truncated string in the Returns:. You could add a little bit more verbose explanation of the method, because the _normalise_disease doesn't tell too much about what this is for.

I'm adding more context.

d0choa

Consistent, nice and tidy. Great!

* chore: snapshot * feat(StudyIndex): adding valiation methods * feat(studyIdex): adding disease validation * fix: typo in test * fix: moving import under the type checking condition * fix: some columns might need to be dropped * fix(study index): preventing [null] arrays in the cohorts object * fix(study index): more context is provided for disease normalisation

DSuveges added 6 commits July 4, 2024 14:51

chore: snapshot

c168362

Merge branch 'dev' of https://github.com/opentargets/gentropy into ds…

40659b1

…_3359_study_index_validation

feat(StudyIndex): adding valiation methods

5e8582b

feat(studyIdex): adding disease validation

8d8a4d9

fix: typo in test

c4eead0

Merge branch 'dev' of https://github.com/opentargets/gentropy into ds…

3ec3292

…_3359_study_index_validation

github-actions bot added size-L Dataset labels Jul 5, 2024

DSuveges changed the title ~~Ds 3359 study index validation~~ feature(StudyIndex): validation for study type, disease, target etc Jul 5, 2024

DSuveges changed the title ~~feature(StudyIndex): validation for study type, disease, target etc~~ feat(StudyIndex): validation for study type, disease, target etc Jul 5, 2024

fix: moving import under the type checking condition

e3cbec5

github-actions bot added the Feature label Jul 5, 2024

DSuveges requested a review from d0choa July 5, 2024 13:41

DSuveges linked an issue Jul 5, 2024 that may be closed by this pull request

Study index gentropy ETL step opentargets/issues#3359

Closed

DSuveges added 3 commits July 5, 2024 18:24

fix: some columns might need to be dropped

c4b633d

Merge branch 'dev' into ds_3359_study_index_validation

1ff81ee

fix(study index): preventing [null] arrays in the cohorts object

48cc033

github-actions bot added the Datasource label Jul 9, 2024

Merge branch 'dev' into ds_3359_study_index_validation

f3f6a66

DSuveges commented Jul 9, 2024

View reviewed changes

src/gentropy/datasource/gwas_catalog/study_index.py Show resolved Hide resolved

Merge branch 'dev' into ds_3359_study_index_validation

d1bc6ae

d0choa reviewed Jul 11, 2024

View reviewed changes

src/gentropy/assets/schemas/study_index.json Show resolved Hide resolved

d0choa reviewed Jul 11, 2024

View reviewed changes

src/gentropy/dataset/dataset.py Show resolved Hide resolved

d0choa reviewed Jul 11, 2024

View reviewed changes

src/gentropy/dataset/study_index.py Show resolved Hide resolved

d0choa reviewed Jul 11, 2024

View reviewed changes

d0choa approved these changes Jul 11, 2024

View reviewed changes

DSuveges added 2 commits July 11, 2024 11:12

Merge branch 'dev' into ds_3359_study_index_validation

452bb46

fix(study index): more context is provided for disease normalisation

31ba4aa

DSuveges merged commit 91817fd into dev Jul 11, 2024
4 checks passed

DSuveges deleted the ds_3359_study_index_validation branch July 11, 2024 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(StudyIndex): validation for study type, disease, target etc #672

feat(StudyIndex): validation for study type, disease, target etc #672

DSuveges commented Jul 5, 2024 •

edited

Loading

DSuveges commented Jul 5, 2024

d0choa Jul 11, 2024 •

edited

Loading

DSuveges Jul 11, 2024

d0choa left a comment

feat(StudyIndex): validation for study type, disease, target etc #672

feat(StudyIndex): validation for study type, disease, target etc #672

Conversation

DSuveges commented Jul 5, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

DSuveges commented Jul 5, 2024

d0choa Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

DSuveges Jul 11, 2024

Choose a reason for hiding this comment

d0choa left a comment

Choose a reason for hiding this comment

DSuveges commented Jul 5, 2024 •

edited

Loading

d0choa Jul 11, 2024 •

edited

Loading