-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(StudyIndex): validation for study type, disease, target etc #672
Conversation
…_3359_study_index_validation
…_3359_study_index_validation
Validation at work: # Study index:
studies = (
StudyIndex.from_parquet(session, "/Users/dsuveges/project_data/gentropy/study_index", recursiveFileLookup=True)
)
# Gene index:
gene_index = (
GeneIndex.from_parquet(session, "/Users/dsuveges/project_data/gentropy/gene_index")
)
# Disease Index:
disease_map = (
session.spark.read.parquet('/Users/dsuveges/project_data/gentropy/diseases')
.select(
f.col('id').alias('diseaseId'),
f.explode_outer(
f.when(
f.col('obsoleteTerms').isNotNull(),
f.array_union(
f.array('id'),
f.col('obsoleteTerms')
)
)
).alias('efo')
)
.withColumn(
'efo',
f.coalesce(f.col('efo'), f.col('diseaseId'))
)
)
validated_studies = (
studies
.validate_disease(disease_map)
.validate_target(gene_index)
.validate_study_type()
.validate_unique_study_id()
.persist()
) Out of 1,975,874, 14,871 studies are flagged:
Distribution of quality flags:
As it can be seen from the labels, the flags are carried over from the pre-validated study indices. |
src/gentropy/dataset/study_index.py
Outdated
disease_map (DataFrame): Reference dataframe with diseases | ||
|
||
Returns: | ||
DataFrame: where the disease column name will contain the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncated string in the Returns:
. You could add a little bit more verbose explanation of the method, because the _normalise_disease
doesn't tell too much about what this is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistent, nice and tidy. Great!
* chore: snapshot * feat(StudyIndex): adding valiation methods * feat(studyIdex): adding disease validation * fix: typo in test * fix: moving import under the type checking condition * fix: some columns might need to be dropped * fix(study index): preventing [null] arrays in the cohorts object * fix(study index): more context is provided for disease normalisation
✨ Context
Logic to validate study index.
🛠 What does this PR implement
The following validation steps are implemented:
StudyIndex.validate_disease
- validating diseases against the provided disease index (a view on the disease index).StudyIndex.validate_study_type
- flagging studies that are not gwas or some kind of qtls.StudyIndex.validate_target
- flagging qtl studies which doesnt' have valid Ensembl gene id, against target index.StudyIndex.validate_unique_study_id
- flagging studies with non-unique study identifiers.How it works:
Failing at any validation steps leads to adding a correspoding flag into the qualityControls column.
Heads up! - to enable the use of the existing flagging instruments from StudyLocus class, the necessary function (
update_quality_flag
) is moved to Dataset class, so all of our datasets are QC-able the same way.🙈 Missing
.validate_disease()
method, a dataframe is expected with all the current and obsolete EFOs. Once we'll have a disease dataset class, this assumption can be changed.🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?