Skip to content

Data pipeline for covidestim, + analysis of revisioning in public COVID data

Notifications You must be signed in to change notification settings

covidestim/covidestim-sources

Repository files navigation

covidestim-sources

This repository provides a way to clean various input data used for the covidestim model, producing the following easy-to-use outcomes:

  • Cases
  • Deaths
  • Vaccination-related risk ratio
  • Vaccinations and boosters adminstered
  • Hospitalizations

These data are offered at the following geographies:

Outcome County-level State-level
Cases
**Deaths$*$ **
Risk-ratio
Vax-boost
Hospitalizations

$*$ Note that due to the discontinuation of JHU data, deaths data are marked NA or $0$ beyond February 14, 2023.

Hospitalizations data were reported Friday-Thursday until June 19, 2023. Then they switched to a Sunday-Saturday reporting scheme. The weekly CDC case data is reported at a weekly Thursday-Wednesday routine. We match the hospitalization weeks to the clostest case report week. That is, the hospitalization reports from Sunday February 5- Saturday February 11 are matched to the case reportes from Thursday February 2 - Wedenesday February 8 (2023 reference dates).

Usage and dependencies

This repository is essentially a series of GNU Make targets (see makefile), which depend on the results of HTTP requests, as well as data sources in data-sources/. None of the "cleaned" data are committed to the repository; you need to make it yourself to produce it.

First, install Git LFS.

Then, clone the repository and initialize the Git submodules, which track some of our external data sources:

git clone https://github.com/covidestim/covidestim-sources && cd covidestim-sources
git submodule init
git submodule update --remote # This will take 5-30 minutes

Then, make sure you have the neccessary R packages installed. These are:

  • tidyverse
  • cli
  • docopt
  • sf
  • vaccineAdjust: Not on CRAN. To install: devtools::install_github("covidestim/vaccineAdjust")
  • raster
  • spdep

You can install them in the R console: install.packages(c('tidyverse', 'cli', 'docopt', 'sf', 'raster', 'spdep')).

Finally, attempt to Make the most important targets. Note, you will need GNU Make >=4.3 installed, which does not ship with OS X.

# Make all primary outcomes
make -Bj data-products/{case-death-rr-boost-hosp.csv,case-death-rr-boost-hosp-state.csv}

Repository structure

  • makefile: The project makefile. If you're confused about how a piece of data gets cleaned, go here first. If you've never read a Makefile before, it's advisable to read an introduction to Make, like this one.
  • data-products/: All cleaned data is written to this directory. Some recipes will also products metadata, which will always have a .json extension.
  • data-sources/: All git submodules are stored here, as well as static files used in recipes, like population sizes, polygons, and records of periods of nonreporting.
  • example-output/: Some example cleaned data, for reference
  • R/: All data cleaning scripts live here

Keeping your data sources up-to-date

Data sources will not automatically update, and thus, make will not normally do anything if you attempt to remake a target! This is undesirable if you believe there may be newer versions of data sources available. To pull new data from sources backed by submodules, run:

git submodule update --remote

And, use make -B to force targets to be remade, like so:

make -B data-products/case-death-rr.csv

Data sources

All data sources for the cleaned data are either:

  • Committed to the repository in data-sources/
  • Committed to the repository in data-sources/, but backed by Git LFS
  • Accessed through HTTP requests in the makefile or from within R scripts
  • Committed to the repository as Git submodules in data-sources/
Data Used for Accessed through Frequency of update
Johns Hopkins CSSE Cases, Deaths Submodule data-sources/jhu-data >Daily, until February 14, 2023
Covid Tracking Project Cases, Deaths (2021-02 - 2021-06) HTTP, api.covidtracking.com No longer updated
NYTimes covid data Nothing, but a future merge will use it to supplement counties missing from the JHU dataset Submodule, data-sources/nyt-data >Daily
USCB county, state population estimates Everything data-sources/{fips,state}pop.csv, reformatted from .xls 1/yr?
DHHS facility-level hospitalizations data Hospitalizations HTTP, healthdata.gov/api 1/wk
Dartmouth Atlas Zip-HSA-HRR crosswalk Hospitalizations (agg/disagg) data-sources/ZipHsaHrr18.csv, downloaded from dartmouthatlas.org 1/yr?
Dartmouth Atlas HSA polygons Hospitalizations (agg/disagg) data-sources/hsa-polygons/ (Git LFS), downloaded from dartmouthatlas.org 1/yr?
Census Block Group polygons Hospitalizations (agg/disagg) data-sources/cbg-polygons/ (Git LFS), downloaded from TIGER 1/yr?
Census Block Group popsize Hospitalizations (agg/disagg) data-sources/population_by_cbg.csv/, extracted from TIGER 1/yr?
Vaccination adjustments Vaccination adjustments data-sources/vaccines-counties.csv Static w/ data through Dec 2022
Data.CDC.gov Booster and vaccination data data-sources/cdc-vax-boost-state.csv/, extracted from cdc-states and data-sources/cdc-vax-boost-county.csv extracted from cdc-counties Daily
Data.CDC.gov Case data, after February 14, 2023 cdc-state-cases and cdc-counties-cases Weekly

Some data sources are tracked as git submodules, and other sources, being accessed through APIs, are fetched over HTTP. Large static files are stored through Git LFS.

The included makefile provides a common interface for directing the fetching and cleaning of all data sources.

Data sources

  • data-sources/vaccines-counties A cached file containing daily RR adjustments to the transition probabilities for each state and county, up until December 31, 2022. This file is rendered by running vaccineAdjust::run(), with data up until December 1, 2021, and freezing the final RR adjustment. After December 1, 2021, vaccination adjustments occur within the covidestim model.

Targets

  • make data-products/case-death-rr-boost-hosp.csv
    Reads cleaned archived JHU county-level case/death data. Joins by location and date with static vaccine IFR adjustments from data-sources/vaccines-counties and the cleaned CDC vax-boost-county.csv. Weekly aggregates are joined with the cleaned hhs-hospitalizations-by-county.csv. Any metadata for counties included in the cleaned data will be stored in case-death-rr-boost-hosp-metadata.json. Joins the weekly CDC case data from data-sources/cdc-cases-raw.csv

  • make data-products/case-death-rr-boost-hosp-state.csv
    Reads cleaned archived JHU state-level data, splicing in archived Covid Tracking Project data. For details on this, see the makefile. Joins by location and date with static vaccine IFR adjustments from data-sources/vaccines-counties and the cleaned CDC vax-boost-state.csv. Weekly aggregates are joined with the cleaned hhs-hospitalizations-by-state.csv. Any metadata for counties included in the cleaned data will be stored in case-death-rr-boost-hosp-state-metadata.json. Joins the weekly CDC case data from data-sources/cdc-cases-state-raw.csv

  • make data-products/nyt-counties.csv
    Clean NYT county-level case/death data. Writes nyt-counties-rejects.csv.

  • make data-products/hhs-hospitalizations-by-state.csv
    State-level aggregation of county level hospitalization data.

  • make data-products/hhs-hospitalizations-by-county.csv
    County-level aggregation of facility level hospitalization data. See the next section for details on how this is done.

  • make data-products/hhs-hospitalizations-by-facility.csv
    Cleans facility-level data from DHHS's API and annotates each faility with an HSA id. Also, computes .min and .max columns to compensate for censoring done when there are 1-3 hospitalizations in a given week.

Hospitalizations data pipeline

Key document: COVID-19 Guidance for Hospital Reporting and FAQs For Hospitals, Hospital Laboratory, and Acute Care Facility Data Reporting, January 6, 2022 revision

We source hospitalizations data from the official HHS facility-level dataset. In order to be useful to our model, we transform these data into a county-level dataset.

Outcome format

Outcomes are presented across 3-4 variables. For an outcome name, the following variables may be present:

  • {name}: The outcome itself, including censored data. This means that if a facility reports -999999 for that week, the name outcome will be equal to -999999.

  • {name}_min: The smallest the outcome could be - all censored values, which each represent a possible range of 1-3, will be resolved to 1.

  • {name}_max: The largest the outcome could be if all censored values are resolved to 3.

  • {name}_max2: The largest the outcome could be if all censored valeus are resolved to 3 and any missing days are imputed using the average of the present days.

    Note: This quantity is not meaningful for the following averaged prevalence outcomes, because they are already averaged across the number of days reported by the facility that week (which is not necessarily 7). For these outcomes, {name}_max == {name}_max2.

    • averageAdultICUPatientsConfirmed
    • averageAdultICUPatientsConfirmedSuspected
    • averageAdultInpatientsConfirmed
    • averageAdultInpatientsConfirmedSuspected

Outcomes table

Variable Meaning min/max max2
fips FIPS code of the county
weekstart YYYY-MM-DD of the firs date in the week
admissionsAdultsConfirmed # admissions of adults with confirmed1 Covid
admissionsAdultsSuspected # admissions of adults with suspected2 Covid
admissionsPedsConfirmed # admissions of peds with confirmed1 Covid
admissionsPedsSuspected # admissions of peds with suspected2 Covid
averageAdultICUPatientsConfirmed Average number of ICU beds occupied by adults with confirmed1 covid that week equal to max
averageAdultICUPatientsConfirmedSuspected Average number of ICU beds occupied by adults with confirmed or suspected12 covid that week equal to max
averageAdultInpatientsConfirmed Average number of inpatient beds occupied by adults with confirmed1 Covid that week equal to max
averageAdultInpatientsConfirmedSuspected Average number of inpatient beds occupied by adults with confirmed1 or suspected2 Covid that week equal to max
covidRelatedEDVisits Total number of ED visits that week related to Covid3

Geographic aggregation/disaggregation

County boundaries, HSA boundaries, CBG boundaries Map of county, HSA, CBG borders

Simply identifying which county each facility lies within and then summing across all facilities in a county carries the following drawbacks:

  • It ignores the fact that the hospital may be treating patients from other counties.

  • It ignores the fact that patients from one county may be treated at a hospital in an adjacent (or even non-adjacent) county.

These two issues will cause particularly large problems (biases) when:

  • There are major medical centers in the area, which are more likely to take the lion's share of severe patients during times of peak Covid prevalence.

  • There are small or sparsely-populated counties, where residents may have to travel outside of their county to seek hospital care.

To help solve this problem, we rely on a dataset maintained by the Dartmouth Atlas which defines geographic units called Hospital Service Areas (HSA). These service areas are meant to represent a notion of a catchment area for each hospital:

Hospital service areas (HSAs) are local health care markets for hospital care. An HSA is a collection of ZIP codes whose residents receive most of their hospitalizations from the hospitals in that area. HSAs were defined by assigning ZIP codes to the hospital area where the greatest proportion of their Medicare residents were hospitalized. Minor adjustments were made to ensure geographic contiguity. Most hospital service areas contain only one hospital. The process resulted in 3,436 HSAs.

Importantly, HSA's are only an approximation of a catchment area, and since a patient may very well travel outside of the "catchment area" for care with some nonzero probability, the concept of a catchment area as "patient always goes to a hospital in this polygon" has inherent limitations as far as fully capturing patient facility choice. See [this 2015 paper][paper] by Kilaru and Carr for a discussion of these problems.

Methodology

Diagram of HSA => FIPS process

Nonetheless, we use HSA's as our aggregate geographical unit because we believe that it is a better representation of a catchment area than what we would get by simply drawing the county-border enclosing each facility. To leverage these HSAs to create county-level hospitalizations data, We:

  1. Aggregate facility-level data to the HSA level
  2. Fracture the HSAs using county boundaries
  3. Use CBG population data to divide the outcomes from fractured HSA's into the intersecting counties in a population-proportional manner. Note that the CBG population is not present for D.C. in our dataset, so we manually add the D.C. population for the corresponding HSA. [paper]: https://pubmed.ncbi.nlm.nih.gov/25961661/ da: https://www.dartmouthatlas.org/faq/

Footnotes

  1. Definition of "Laboratory-confirmed Covid":
    Definition of "Laboratory-confirmed covid" Source: Page 44, HHS hospital reporting guidance 2 3 4 5 6

  2. Definition of "suspected Covid":
    "“Suspected” is defined as a person who is being managed as though he/she has COVID-19 because of signs and symptoms suggestive of COVID-19 but does not have a laboratory-positive COVID-19 test result."
    Source: Page 14, HHS hospital reporting guidance 2 3 4

  3. Definition of "related to Covid":
    "Enter the total number of ED visits who were seen on the previous calendar day who had a visit related to suspected or laboratory-confirmed COVID-19. Do not count patients who receive a COVID-19 test solely for screening purposes in the absence of COVID-19 symptoms."
    Source: Page 14, HHS hospital reporting guidance

About

Data pipeline for covidestim, + analysis of revisioning in public COVID data

Topics

Resources

Stars

Watchers

Forks