GitHub - vlooten/BioQuality: What can millions of laboratory values tell us about the temporal aspect of data quality? study of data spanning 17 years in a clinical data warehouse. Comput Methods Programs Biomed. 2018 Dec 29. 10.1016/j.cmpb.2018.12.030

vlooten / BioQuality Public

forked from equipe22/BioQuality

Notifications You must be signed in to change notification settings
Fork 0
Star 0

What can millions of laboratory values tell us about the temporal aspect of data quality? study of data spanning 17 years in a clinical data warehouse. Comput Methods Programs Biomed. 2018 Dec 29. 10.1016/j.cmpb.2018.12.030

hal.archives-ouvertes.fr/hal-01978796/document

GPL-3.0 license

0 stars 1 fork Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Simulations_review		Simulations_review
BIO.20170821101046.csv		BIO.20170821101046.csv
Count.csv		Count.csv
LICENSE		LICENSE
ReadMe.Rmd		ReadMe.Rmd
breakpoints.R		breakpoints.R
datareal.R		datareal.R
datasimu.R		datasimu.R
discretization.R		discretization.R
fun_breakpoints.R		fun_breakpoints.R
fun_datasimu.R		fun_datasimu.R
fun_discretization.R		fun_discretization.R
fun_missingdata.R		fun_missingdata.R
fun_movingquantiles.R		fun_movingquantiles.R
fun_trends.R		fun_trends.R
init.R		init.R
loadresults.R		loadresults.R
main.R		main.R
missingdata.R		missingdata.R
model_exams.Rmd		model_exams.Rmd
movingquantiles.R		movingquantiles.R
sensibility_analyses.rds		sensibility_analyses.rds
trends.R		trends.R

Repository files navigation

The objective of these scripts is to provide diagnostic tools of the temporal quality of biological data in a clinical data warehouse.
These scripts are based on:
"WHAT CAN MILLIONS OF LABORATORY VALUES TELL US ABOUT THE TEMPORAL ASPECT OF DATA QUALITY? STUDY OF DATA SPANNING 17 YEARS IN A CLINICAL DATA WAREHOUSE".
Comput Methods Programs Biomed. 2018 Dec 29. pii: S0169-2607(18)30708-9.
doi: 10.1016/j.cmpb.2018.12.030.
https://www.ncbi.nlm.nih.gov/pubmed/30612785

Abstract
OBJECTIVE:
To identify common temporal evolution profiles in biological data and propose a semi-automated method to these patterns in a clinical data warehouse (CDW).
MATERIALS AND METHODS:
We leveraged the CDW of the European Hospital Georges Pompidou and tracked the evolution of 192 biological parameters over a period of 17 years (for 445,000 + patients, and 131 million laboratory test results)
RESULTS:
We identified three common profiles of evolution: discretization, breakpoints, and trends. We developed computational and statistical methods to identify these profiles in the CDW. Overall, of the 192 observed biological parameters (87,814,136 values), 135 presented at least one evolution. We identified breakpoints in 30 distinct parameters, discretizations in 32, and trends in 79.
DISCUSSION AND CONCLUSION:
our method allowed the identification of several temporal events in the data. Considering the distribution over time of these events, we identified probable causes for the observed profiles: instruments or software upgrades and changes in computation formulas. We evaluated the potential impact for data reuse. Finally, we formulated recommendations to enable safe use and sharing of biological data collection to limit the impact of data evolution in retrospective and federated studies (e.g. the annotation of laboratory parameters presenting breakpoints or trends).

# The data input format

Two examples files have to be put in a directory called 'Example' in the working directory:
+ Count.csv
+ BIO.20170821101046.csv

These files helps you to adapt your SQL query for the proposed data profiling algorithm.
The file 'Count.csv' includes the example of the list of exams as an input.
The file 'BIO.20170821101046.csv' includes the require input format for lab tests data.

# How to use scripts without real data ?

To run for the first time the data profiling algorithm, we recommend to try it on simulated data:
+ Put all the R files in a R project directory.
+ Launch the master script called 'main.R'

Report and derived files will be create with the simulated data. The simulated data are generated with the script 'datasimu.R'. This script generates the differents patterns described in the article.

# How to use scripts with real data ?

After lauching programs with simulated data. You will show the data directory where to put your real data. You have also to generate Count.csv which contains the index of lab tests.
If you have an I2B2 data warehouse you can also adapt the datareal.R file.

# Landscape of R files

There are 13 R files :

+ init.R: install and load packages, create directories, source fun_ R files
+ main.R: master script call all script
+ datasimu.R: simulation of data to test the programs. This script generates the differents patterns described in the article.
+ datareal.R: extract files from i2b2 clinical data warehouse in the right format
+ movingquantiles.R: compute the moving quantiles
+ fun_movingquantiles.R: auxiliary file associated to the movingquantiles.R file
+ missingdata.R: detect missing data (not described in the article, previous version)
+ fun_missingdata.R: auxiliary file associated to the missingdata.R file (not described in the article, previous version)
+ discretization.R: detect discritization
+ fun_discretization.R: auxiliary file associated to the discretization.R file
+ breakpoints.R: detect breakpoints
+ fun_breakpoints.R: auxiliary file associated to the breakpoints.R file
+ trends.R: performe regressions
+ fun_trends.R: auxiliary file associated to the trends.R file