This repository provides a way to clean various input data used for the
covidestim
model, producing the following easy-to-use outcomes:
- Cases
- Deaths
- Vaccination-related risk ratio
- Vaccinations and boosters adminstered
- Hospitalizations
These data are offered at the following geographies:
Outcome | County-level | State-level |
---|---|---|
Cases | ✓ | ✓ |
**Deaths$*$ ** | ✓ | ✓ |
Risk-ratio | ✓ | ✓ |
Vax-boost | ✓ | ✓ |
Hospitalizations | ✓ | ✓ |
Hospitalizations data were reported Friday-Thursday until June 19, 2023. Then they switched to a Sunday-Saturday reporting scheme. The weekly CDC case data is reported at a weekly Thursday-Wednesday routine. We match the hospitalization weeks to the clostest case report week. That is, the hospitalization reports from Sunday February 5- Saturday February 11 are matched to the case reportes from Thursday February 2 - Wedenesday February 8 (2023 reference dates).
This repository is essentially a series of GNU Make targets (see makefile
),
which depend on the results of HTTP requests, as well as data sources in
data-sources/
. None of the "cleaned" data are committed to the repository;
you need to make it yourself to produce it.
First, install Git LFS.
Then, clone the repository and initialize the Git submodules, which track some of our external data sources:
git clone https://github.com/covidestim/covidestim-sources && cd covidestim-sources
git submodule init
git submodule update --remote # This will take 5-30 minutes
Then, make sure you have the neccessary R packages installed. These are:
tidyverse
cli
docopt
sf
vaccineAdjust
: Not on CRAN. To install:devtools::install_github("covidestim/vaccineAdjust")
raster
spdep
You can install them in the R console: install.packages(c('tidyverse', 'cli', 'docopt', 'sf', 'raster', 'spdep'))
.
Finally, attempt to Make the most important targets. Note, you will need GNU Make >=4.3 installed, which does not ship with OS X.
# Make all primary outcomes
make -Bj data-products/{case-death-rr-boost-hosp.csv,case-death-rr-boost-hosp-state.csv}
makefile
: The project makefile. If you're confused about how a piece of data gets cleaned, go here first. If you've never read a Makefile before, it's advisable to read an introduction to Make, like this one.data-products/
: All cleaned data is written to this directory. Some recipes will also products metadata, which will always have a.json
extension.data-sources/
: All git submodules are stored here, as well as static files used in recipes, like population sizes, polygons, and records of periods of nonreporting.example-output/
: Some example cleaned data, for referenceR/
: All data cleaning scripts live here
Data sources will not automatically update, and thus, make
will not
normally do anything if you attempt to remake a target! This is undesirable if
you believe there may be newer versions of data sources available. To pull new
data from sources backed by submodules, run:
git submodule update --remote
And, use make -B
to force targets to be remade, like so:
make -B data-products/case-death-rr.csv
All data sources for the cleaned data are either:
- Committed to the repository in
data-sources/
- Committed to the repository in
data-sources/
, but backed by Git LFS - Accessed through HTTP requests in the
makefile
or from within R scripts - Committed to the repository as Git submodules in
data-sources/
Data | Used for | Accessed through | Frequency of update |
---|---|---|---|
Johns Hopkins CSSE | Cases, Deaths | Submodule data-sources/jhu-data |
>Daily, until February 14, 2023 |
Covid Tracking Project | Cases, Deaths (2021-02 - 2021-06) | HTTP, api.covidtracking.com |
No longer updated |
NYTimes covid data | Nothing, but a future merge will use it to supplement counties missing from the JHU dataset | Submodule, data-sources/nyt-data |
>Daily |
USCB county, state population estimates | Everything | data-sources/{fips,state}pop.csv , reformatted from .xls |
1/yr? |
DHHS facility-level hospitalizations data | Hospitalizations | HTTP, healthdata.gov/api |
1/wk |
Dartmouth Atlas Zip-HSA-HRR crosswalk | Hospitalizations (agg/disagg) | data-sources/ZipHsaHrr18.csv , downloaded from dartmouthatlas.org |
1/yr? |
Dartmouth Atlas HSA polygons | Hospitalizations (agg/disagg) | data-sources/hsa-polygons/ (Git LFS), downloaded from dartmouthatlas.org |
1/yr? |
Census Block Group polygons | Hospitalizations (agg/disagg) | data-sources/cbg-polygons/ (Git LFS), downloaded from TIGER |
1/yr? |
Census Block Group popsize | Hospitalizations (agg/disagg) | data-sources/population_by_cbg.csv/ , extracted from TIGER |
1/yr? |
Vaccination adjustments | Vaccination adjustments | data-sources/vaccines-counties.csv |
Static w/ data through Dec 2022 |
Data.CDC.gov | Booster and vaccination data | data-sources/cdc-vax-boost-state.csv/ , extracted from cdc-states and data-sources/cdc-vax-boost-county.csv extracted from cdc-counties |
Daily |
Data.CDC.gov | Case data, after February 14, 2023 | cdc-state-cases and cdc-counties-cases | Weekly |
Some data sources are tracked as git submodules, and other sources, being accessed through APIs, are fetched over HTTP. Large static files are stored through Git LFS.
The included makefile
provides a common interface for directing the fetching
and cleaning of all data sources.
data-sources/vaccines-counties
A cached file containing daily RR adjustments to the transition probabilities for each state and county, up until December 31, 2022. This file is rendered by running vaccineAdjust::run(), with data up until December 1, 2021, and freezing the final RR adjustment. After December 1, 2021, vaccination adjustments occur within the covidestim model.
-
make data-products/case-death-rr-boost-hosp.csv
Reads cleaned archived JHU county-level case/death data. Joins by location and date with static vaccine IFR adjustments fromdata-sources/vaccines-counties
and the cleaned CDCvax-boost-county.csv
. Weekly aggregates are joined with the cleanedhhs-hospitalizations-by-county.csv
. Any metadata for counties included in the cleaned data will be stored incase-death-rr-boost-hosp-metadata.json
. Joins the weekly CDC case data fromdata-sources/cdc-cases-raw.csv
-
make data-products/case-death-rr-boost-hosp-state.csv
Reads cleaned archived JHU state-level data, splicing in archived Covid Tracking Project data. For details on this, see themakefile
. Joins by location and date with static vaccine IFR adjustments fromdata-sources/vaccines-counties
and the cleaned CDCvax-boost-state.csv
. Weekly aggregates are joined with the cleanedhhs-hospitalizations-by-state.csv
. Any metadata for counties included in the cleaned data will be stored incase-death-rr-boost-hosp-state-metadata.json
. Joins the weekly CDC case data fromdata-sources/cdc-cases-state-raw.csv
-
make data-products/nyt-counties.csv
Clean NYT county-level case/death data. Writesnyt-counties-rejects.csv
. -
make data-products/hhs-hospitalizations-by-state.csv
State-level aggregation of county level hospitalization data. -
make data-products/hhs-hospitalizations-by-county.csv
County-level aggregation of facility level hospitalization data. See the next section for details on how this is done. -
make data-products/hhs-hospitalizations-by-facility.csv
Cleans facility-level data from DHHS's API and annotates each faility with an HSA id. Also, computes .min and .max columns to compensate for censoring done when there are 1-3 hospitalizations in a given week.
Key document: COVID-19 Guidance for Hospital Reporting and FAQs For Hospitals, Hospital Laboratory, and Acute Care Facility Data Reporting, January 6, 2022 revision
We source hospitalizations data from the official HHS facility-level dataset. In order to be useful to our model, we transform these data into a county-level dataset.
Outcomes are presented across 3-4 variables. For an outcome name
, the
following variables may be present:
-
{name}
: The outcome itself, including censored data. This means that if a facility reports-999999
for that week, thename
outcome will be equal to-999999
. -
{name}_min
: The smallest the outcome could be - all censored values, which each represent a possible range of 1-3, will be resolved to 1. -
{name}_max
: The largest the outcome could be if all censored values are resolved to 3. -
{name}_max2
: The largest the outcome could be if all censored valeus are resolved to 3 and any missing days are imputed using the average of the present days.Note: This quantity is not meaningful for the following averaged prevalence outcomes, because they are already averaged across the number of days reported by the facility that week (which is not necessarily 7). For these outcomes,
{name}_max == {name}_max2
.averageAdultICUPatientsConfirmed
averageAdultICUPatientsConfirmedSuspected
averageAdultInpatientsConfirmed
averageAdultInpatientsConfirmedSuspected
Variable | Meaning | min /max |
max2 |
---|---|---|---|
fips |
FIPS code of the county | ||
weekstart |
YYYY-MM-DD of the firs date in the week | ||
admissionsAdultsConfirmed |
# admissions of adults with confirmed1 Covid | ✓ | ✓ |
admissionsAdultsSuspected |
# admissions of adults with suspected2 Covid | ✓ | ✓ |
admissionsPedsConfirmed |
# admissions of peds with confirmed1 Covid | ✓ | ✓ |
admissionsPedsSuspected |
# admissions of peds with suspected2 Covid | ✓ | ✓ |
averageAdultICUPatientsConfirmed |
Average number of ICU beds occupied by adults with confirmed1 covid that week | ✓ | equal to max |
averageAdultICUPatientsConfirmedSuspected |
Average number of ICU beds occupied by adults with confirmed or suspected12 covid that week | ✓ | equal to max |
averageAdultInpatientsConfirmed |
Average number of inpatient beds occupied by adults with confirmed1 Covid that week | ✓ | equal to max |
averageAdultInpatientsConfirmedSuspected |
Average number of inpatient beds occupied by adults with confirmed1 or suspected2 Covid that week | ✓ | equal to max |
covidRelatedEDVisits |
Total number of ED visits that week related to Covid3 | ✓ |
County boundaries, HSA boundaries, CBG boundaries
Simply identifying which county each facility lies within and then summing across all facilities in a county carries the following drawbacks:
-
It ignores the fact that the hospital may be treating patients from other counties.
-
It ignores the fact that patients from one county may be treated at a hospital in an adjacent (or even non-adjacent) county.
These two issues will cause particularly large problems (biases) when:
-
There are major medical centers in the area, which are more likely to take the lion's share of severe patients during times of peak Covid prevalence.
-
There are small or sparsely-populated counties, where residents may have to travel outside of their county to seek hospital care.
To help solve this problem, we rely on a dataset maintained by the Dartmouth Atlas which defines geographic units called Hospital Service Areas (HSA). These service areas are meant to represent a notion of a catchment area for each hospital:
Hospital service areas (HSAs) are local health care markets for hospital care. An HSA is a collection of ZIP codes whose residents receive most of their hospitalizations from the hospitals in that area. HSAs were defined by assigning ZIP codes to the hospital area where the greatest proportion of their Medicare residents were hospitalized. Minor adjustments were made to ensure geographic contiguity. Most hospital service areas contain only one hospital. The process resulted in 3,436 HSAs.
Importantly, HSA's are only an approximation of a catchment area, and since a patient may very well travel outside of the "catchment area" for care with some nonzero probability, the concept of a catchment area as "patient always goes to a hospital in this polygon" has inherent limitations as far as fully capturing patient facility choice. See [this 2015 paper][paper] by Kilaru and Carr for a discussion of these problems.
Nonetheless, we use HSA's as our aggregate geographical unit because we believe that it is a better representation of a catchment area than what we would get by simply drawing the county-border enclosing each facility. To leverage these HSAs to create county-level hospitalizations data, We:
- Aggregate facility-level data to the HSA level
- Fracture the HSAs using county boundaries
- Use CBG population data to divide the outcomes from fractured HSA's into the intersecting counties in a population-proportional manner. Note that the CBG population is not present for D.C. in our dataset, so we manually add the D.C. population for the corresponding HSA. [paper]: https://pubmed.ncbi.nlm.nih.gov/25961661/ da: https://www.dartmouthatlas.org/faq/
Footnotes
-
Definition of "Laboratory-confirmed Covid":
Source: Page 44, HHS hospital reporting guidance ↩ ↩2 ↩3 ↩4 ↩5 ↩6 -
Definition of "suspected Covid":
"“Suspected” is defined as a person who is being managed as though he/she has COVID-19 because of signs and symptoms suggestive of COVID-19 but does not have a laboratory-positive COVID-19 test result."
Source: Page 14, HHS hospital reporting guidance ↩ ↩2 ↩3 ↩4 -
Definition of "related to Covid":
"Enter the total number of ED visits who were seen on the previous calendar day who had a visit related to suspected or laboratory-confirmed COVID-19. Do not count patients who receive a COVID-19 test solely for screening purposes in the absence of COVID-19 symptoms."
Source: Page 14, HHS hospital reporting guidance ↩