Skip to content
Casey Ta edited this page Aug 24, 2024 · 13 revisions

Homepage Link: https://cohd.io/about.html

Columbia Open Health Data (COHD)

Columbia Open Health Data (COHD) provides access to counts and patient prevalence (i.e., prevalence from electronic health records) of conditions, procedures, drug exposures, and patient demographics, and the co-occurrence frequencies between them. Count and frequency data were derived from the Columbia University Irving Medical Center's OHDSI database including inpatient and outpatient data. Counts are the number of patients with the concept, e.g., diagnosed with a condition, exposed to a drug, or who had a procedure. Frequencies are the number of patients with the concept divided by the total number of patients in the dataset. Clinical concepts (e.g., conditions, procedures, drugs) are coded by their standard concept ID in the OMOP Common Data Model. To protect patient privacy, all concepts and pairs of concepts where the count ≤ 10 were excluded, and counts were randomized by the Poisson distribution.

Data sources

Outpatient and inpatient EHR data extracted from Columbia University Irving Medical Center's clinical data warehouse

Key metrics

COHD provides the following association metrics and their statistical measures of significance captured inside of biolink:StudyResult structures:

  • Raw counts of each concept and concept pair co-occurrence - biolink:ConceptCountAnalysisResult
  • Chi-squared analysis (Bonferonni adjusted p-value) - biolink:ChiSquaredAnalysisResult
  • Relative frequency (99% confidence interval) - biolink:RelativeFrequencyAnalysisResult
  • Observed-expected frequency ratio (99% confidence interval) - biolink:ObservedExpectedFrequencyAnalysisResult
    Example values:
    Strength of association Condition Drug ln ratio 99% CI
    Strong positive type 2 diabetes metformin 2.570 2.554, 2.586
    Weak or no association Sprain of knee pneumococcal polysaccharide vaccine 0.078 -0.094, 0.231
    Negative birth isotretinoin -1.337 -2.590, -0.644

Data sets

COHD contains the following data sets:

  1. 5-year non-hierarchical dataset: Includes clinical data from 2013-2017
  2. lifetime non-hierarchical dataset: Includes clinical data from all dates
  3. 5-year hierarchical dataset: Counts for each concept include patients from descendant concepts. Includes clinical data from 2013-2017.
  4. Temporal beta: Quantifies temporal relations between all concept pairs. Includes clinical data from all dates.

While the lifetime dataset captures a larger patient population and range of concepts, the 5-year dataset has better underlying data consistency. In the 5-year hierarchical data set, the counts for each concept include the patients from all descendant concepts. For example, the count for ibuprofen (ID 1177480) includes patients with Ibuprofen 600 MG Oral Tablet (ID 19019073 patients), Ibuprofen 400 MG Oral Tablet (ID 19019072), Ibuprofen 20 MG/ML Oral Suspension (ID 19019050), etc. The COHD KP automatically chooses the most appropriate COHD data set depending on the concepts being queried.

More details about the COHD dataset can be found in the Clinical Data Provider prototype Kick-off presentation PDF

COHD for COVID-19 Research (COHD-COVID)

Columbia Open Health Data for COVID-19 Research (COHD-COVID) is similar to COHD but adjusts the analysis and cohorts to facilitate COVID-19 research. COHD-COVID provides access to counts and visit prevalence (i.e., prevalence from electronic health records) of conditions, procedures, drug exposures, and the co-occurrence frequencies between them. Count and frequency data were derived from the Columbia University Irving Medical Center's OHDSI database including inpatient data. Counts are the number of visits with the concept, e.g., diagnosed with a condition, exposed to a drug, or a procedure was performed. Frequencies are the number of visits with the concept divided by the total number of visits in the dataset. Clinical concepts (e.g., conditions, procedures, drugs) are coded by their standard concept ID in the OMOP Common Data Model. To protect patient privacy, all concepts and pairs of concepts where the count ≤ 10 were excluded, and counts were randomized by the Poisson distribution.

Data sources

Inpatient EHR data extracted from Columbia University Irving Medical Center's clinical data warehouse

Key metrics

The same association metrics described above are provided.

Data sets

Datasets from three primary cohorts are available:

  1. COVID-19: Hospitalized patients aged 18 or older with a COVID-19 related condition diagnosis and/or a confirmed positive COVID-19 test during their hospitalization period or within the prior 21 days. Date range: March 1, 2020 to September 1, 2020. This cohort is also further stratified by sex (male and female) and age (adult: 18-64, senior: 65+).
  2. General inpatient: All hospitalized patients aged 18 or older. Date range: January 1, 2014 to December 31, 2019.
  3. Influenza: Hospitalized patients aged 18 or older who had at least one occurrence of influenza conditions or pre-coordinated positive measurements or positive influenza testing in the prior 21 days or during their hospitalization period. Date range: January 1, 2014 to December 31, 2019.

Technical User Guide

Mode of Access

Use Cases

Knowledge Sources Accessed

Source Code

Additional information

References

  • Ta CN, Dumontier M, Hripcsak G, Tatonetti NP, Weng C. Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Scientific Data. 5:180273; 2018. doi:10.1038/sdata.2018.273
  • Lee J, Kim JH, Liu C, Hripcsak G, Natarajan K, Ta CN, Weng C. Columbia Open Health Data for COVID-19 Research: Database Analysis. Journal of Medical Internet Research. 23(9):e31122; 2021. doi:10.2196/31122

External Documentation

Contact

Clone this wiki locally