Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(StudyIndex): validation for study type, disease, target etc #672

Merged
merged 14 commits into from
Jul 11, 2024

Conversation

DSuveges
Copy link
Contributor

@DSuveges DSuveges commented Jul 5, 2024

✨ Context

Logic to validate study index.

🛠 What does this PR implement

The following validation steps are implemented:

  • StudyIndex.validate_disease - validating diseases against the provided disease index (a view on the disease index).
  • StudyIndex.validate_study_type - flagging studies that are not gwas or some kind of qtls.
  • StudyIndex.validate_target - flagging qtl studies which doesnt' have valid Ensembl gene id, against target index.
  • StudyIndex.validate_unique_study_id - flagging studies with non-unique study identifiers.
  • Tests for all methods.

How it works:

qcd_study_index = (
    study_index
    .validate_disease(disease_map)
    .validate_target(target_index)
    .validate_study_type()
    .validate_unique_study_id()
)

Failing at any validation steps leads to adding a correspoding flag into the qualityControls column.

Heads up! - to enable the use of the existing flagging instruments from StudyLocus class, the necessary function (update_quality_flag) is moved to Dataset class, so all of our datasets are QC-able the same way.

🙈 Missing

  1. In this PR the logic IS NOT organised into a step. There's no orchestration, just the business logic.
  2. Gentropy has no disease index dataset. So there's no point in ingesting disease index as it is. So in the current implementation of the .validate_disease() method, a dataframe is expected with all the current and obsolete EFOs. Once we'll have a disease dataset class, this assumption can be changed.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@DSuveges DSuveges changed the title Ds 3359 study index validation feature(StudyIndex): validation for study type, disease, target etc Jul 5, 2024
@DSuveges DSuveges changed the title feature(StudyIndex): validation for study type, disease, target etc feat(StudyIndex): validation for study type, disease, target etc Jul 5, 2024
@DSuveges DSuveges requested a review from d0choa July 5, 2024 13:41
@DSuveges DSuveges linked an issue Jul 5, 2024 that may be closed by this pull request
@DSuveges
Copy link
Contributor Author

DSuveges commented Jul 5, 2024

Validation at work:

# Study index:
studies = (
    StudyIndex.from_parquet(session, "/Users/dsuveges/project_data/gentropy/study_index", recursiveFileLookup=True)
)

# Gene index:
gene_index = (
    GeneIndex.from_parquet(session, "/Users/dsuveges/project_data/gentropy/gene_index")
)

# Disease Index:
disease_map = (
    session.spark.read.parquet('/Users/dsuveges/project_data/gentropy/diseases')
    .select(
        f.col('id').alias('diseaseId'),
        f.explode_outer(
            f.when(
                f.col('obsoleteTerms').isNotNull(), 
                f.array_union(
                    f.array('id'), 
                    f.col('obsoleteTerms')
                )
            )
        ).alias('efo')
    )
    .withColumn(
        'efo',
        f.coalesce(f.col('efo'), f.col('diseaseId'))
    )
)


validated_studies = (
    studies
    .validate_disease(disease_map)
    .validate_target(gene_index)
    .validate_study_type()
    .validate_unique_study_id()
    .persist()
)

Out of 1,975,874, 14,871 studies are flagged:

+--------------+-----+
|     projectId|count|
+--------------+-----+
|  Nedelec_2016|   40|
|        OneK1K|   21|
|   Alasoo_2018|   30|
|          GTEx| 1317|
|   FINNGEN_R10| 4816|
|          GCST| 7670|
|   Nathan_2022|   38|
|        FUSION|   69|
|Schmiedel_2018|  119|
|    Cytoimmgen|   39|
|       GENCORD|   25|
|     BLUEPRINT|   82|
|      GEUVADIS|   22|
|    Lepik_2017|   26|
|    Quach_2016|  104|
|  Fairfax_2014|   15|
|        ROSMAP|   56|
|        HipSci|   40|
|      BrainSeq|   53|
|       TwinsUK|   83|
+--------------+-----+

Distribution of quality flags:

+----------------------------------------------------+-----+
|qualityControl                                      |count|
+----------------------------------------------------+-----+
|Failed summary statistics quality control           |472  |
|Target/gene identifier could not match to reference.|2385 |
|No valid disease identifier found.                  |11962|
|The identifier of this study is not unique.         |4838 |
|Non-additive model                                  |32   |
+----------------------------------------------------+-----+

As it can be seen from the labels, the flags are carried over from the pre-validated study indices.

disease_map (DataFrame): Reference dataframe with diseases

Returns:
DataFrame: where the disease column name will contain the
Copy link
Collaborator

@d0choa d0choa Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncated string in the Returns:. You could add a little bit more verbose explanation of the method, because the _normalise_disease doesn't tell too much about what this is for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding more context.

Copy link
Collaborator

@d0choa d0choa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent, nice and tidy. Great!

@DSuveges DSuveges merged commit 91817fd into dev Jul 11, 2024
4 checks passed
@DSuveges DSuveges deleted the ds_3359_study_index_validation branch July 11, 2024 10:29
project-defiant pushed a commit that referenced this pull request Jul 12, 2024
* chore: snapshot

* feat(StudyIndex): adding valiation methods

* feat(studyIdex): adding disease validation

* fix: typo in test

* fix: moving import under the type checking condition

* fix: some columns might need to be dropped

* fix(study index): preventing [null] arrays in the cohorts object

* fix(study index): more context is provided for disease normalisation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Study index gentropy ETL step
2 participants