Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tremendous number of data anomalies - be careful presenting this data. #14

Closed
oisact opened this issue Dec 14, 2020 · 8 comments
Closed

Comments

@oisact
Copy link

oisact commented Dec 14, 2020

I'm attempting to make use of this data to provide communities with information about their hospital capacity and COVID rates. However, this dataset is absolutely rife with anomalies. I cannot find one pair of metrics that I can compare across the dataset that does not result in nonsense data.

More specifically, data that is supposed to be a subset of some other data is often GREATER than the thing it is supposed to be a subset of. For example, there are over 1,000 records for which the all_adult_hospital_inpatient_beds_7_day (3.b below) is greater than the inpatient_beds_7_day (3.a). Since adult inpatient beds are a subset of the inpatient beds, that should not be possible. In some cases it is a rounding type error (off by 1 bed +-) and in other cases the adult inpatient beds was specified but not the inpatient beds. However, in most cases the data simply does not make sense (IE 159 adult inpatient beds and 132.7 total inpatient beds). In those cases I looked at the other values, to see if the facility was full beyond capacity, or had high number of COVID patients, but that was not the case.

Note that those values are not even census (IE actual patient counts) but the raw BED counts, regardless of whether or not the beds are occupied. Why are hospitals reporting data as nonsensical as that (we physically have more adult inpatient beds in the hospital than we have all beds total)?

From the document provided to hospitals providing guidance on reporting data (https://www.hhs.gov/sites/default/files/covid-19-faqs-hospitals-hospital-laboratory-acute-care-facility-data-reporting.pdf):

2-a) All hospital beds
Total number of all staffed inpatient and outpatient beds in your hospital, including all overflow, observation, and active surge/expansion beds used for inpatients and for outpatients (includes all ICU, ED, and observation).
Subset:
2-b) All adult hospital beds
Total number of all staffed inpatient and outpatient adult beds in your hospital, including all overflow and active surge/expansion beds for inpatients and for outpatients (includes all ICU, ED, and observation)

3-a) All hospital inpatient beds
Total number of staffed inpatient beds in your hospital including all overflow, observation, and active surge/expansion beds used for inpatients (includes all ICU beds). This is a subset of #2.
Subset:
3-b) Adult hospital inpatient beds
Total number of staffed inpatient adult beds in your hospital including all overflow, observation, and active surge/expansion beds used for inpatients (includes all designated ICU beds). This is also a subset of #2

There are thousands of records for every one of those subsets where the subset is greater than the parent set(s).

It is apparent that there is zero data validation during the data entry of this information (IE "Your total adult inpatient bed count cannot be greater than the total inpatient bed count for your facility"), and that there is serious confusion and varied interpretation by the staff providing this data as to what the data is supposed to mean. I have a hunch that some facilities are reporting the total daily inpatients as the total daily bed occupancy, which is NOT the same thing. A single bed can be occupied by more than one patient in a day, even if there are other beds or entire floors in the facility that are not being used. Thus it drastically inflates how full the hospital was since all those patients were not occupying beds at the same time.

I don't see how this data can be used with being vetted manually per-record by a person (not realistic for those of us that consume this data), or simply passing all these obvious inaccuracies along to the general public in our data presentation (garbage in, garbage out).

@ftrotter
Copy link
Contributor

There are a lot of issues here (I will try and break them up and answer them as I have time) but it would be most helpful if you can provide which ccns are showing the specific problems that you mention.

@oisact
Copy link
Author

oisact commented Dec 14, 2020

Sure. Attached is a CSV with 390 records where the inpatient_beds_used_7_day_avg is more than 2 greater than the inpatient_beds_7_day_avg and the inpatient_beds_7_day_avg is greater than 0.
inpatient_beds_used_greater_than_capacity.txt

The "row" column references the row in the CSV I have, which is up to collection week 2020-11-27.

@oisact
Copy link
Author

oisact commented Dec 14, 2020

Attached is CSV with 499 records where all_adult_hospital_inpatient_bed_occupied_7_day is more than 2 greater than all_adult_hospital_inpatient_beds_7_day and all_adult_hospital_inpatient_beds_7_day is greater than 0.
adult_inpatient_beds_used_greater_than_capacity.txt

@oisact
Copy link
Author

oisact commented Dec 14, 2020

Attached is a CSV with 678 records where all_adult_hospital_inpatient_beds_7_day is more than 2 greater than inpatient_beds_7_day and inpatient_beds_7_day is greater than 0. In other words the hospital has more adult inpatient beds than they have inpatient beds total.
inpatient_bed_capacity_greater_than_capacity.txt

@oisact
Copy link
Author

oisact commented Dec 14, 2020

Here are 67 records reporting more previous_day_covid_ED_visits_7_day_sum patients than previous_day_total_ED_visits_7_day_sum where previous_day_total_ED_visits_7_day_sum is greater than 0.
ed_covid_greater_than_visits.txt

@oisact
Copy link
Author

oisact commented Dec 14, 2020

129 records where staffed_icu_adult_patients_confirmed_and_suspected_covid_7_day is more than 2 greater than icu_beds_used_7_day and icu_beds_used_7_day is greater than 0.
icu_covid_greater_than_visits.txt

@daveluo
Copy link

daveluo commented Dec 14, 2020

Hi @oisact, thanks for sharing the exact ccns and your findings so far. I did the same comparison independently and found similar numbers as you (and crosschecked against yours). I agree that these are great data sanity checks to run the dataset through to find anomalous reporting by individual facilities.

However, I think you're overstating the magnitude of the problem by leading off with "Tremendous number of data anomalies" and concluding "I don't see how this data can be used with[out] being vetted manually per-record by a person (not realistic for those of us that consume this data), or simply passing all these obvious inaccuracies along to the general public in our data presentation (garbage in, garbage out)."

This is 100s of anomalous rows of data in a dataset that has over 87k rows (and counting, since it's added to with every week's updates, i.e the version just out today has over 92k rows). The anomaly rate is <1% from this perspective.

Diving in more and using your sanity filter which captures the most rows ("all_adult_hospital_inpatient_beds_7_day is more than 2 greater than inpatient_beds_7_day and inpatient_beds_7_day is greater than 0" with the same _20201207.csv data release), I see that many of the anomalous rows are being reporting by the same hospitals over weeks, which suggests that those particular facilities may need outreach to correct or at least better understand how they're regularly reporting these various metrics.

Or in many cases, it's a one-time (as in one collection_week) reporting error by a facility. This is also quite plausible given that it's humans doing their best to report 80-something data fields every day of the week, sometimes in the midst of also managing the highest hospitalization volumes they've ever seen.

I end up with 199 unique facilities by my take (198 unique facilities in your file of 678 rows, the difference being ccn==101312). Out of 4900 or so unique facilities in total represented in this dataset, that's a 4% anomaly rate.

For example, here are the top 10 rows with the highest bed difference:

collection_week hospital_pk ccn hospital_name city state all_adult_hospital_inpatient_beds_7_day_avg inpatient_beds_7_day_avg bed_diff
2020-11-06 161381 161381 SANFORD SHELDON MEDICAL CENTER SHELDON IA 683.6 45.0 638.6
2020-07-31 100105 100105 CLEVELAND CLINIC INDIAN RIVER HOSPITAL VERO BEACH FL 1863.9 1243.0 620.9
2020-07-31 490063 490063 INOVA FAIRFAX HOSPITAL FALLS CHURCH VA 687.0 300.0 387.0
2020-08-28 100002 100002 BETHESDA HOSPITAL EAST BOYNTON BEACH FL 956.4 571.0 385.4
2020-07-31 100002 100002 BETHESDA HOSPITAL EAST BOYNTON BEACH FL 955.7 571.0 384.7
2020-10-02 100002 100002 BETHESDA HOSPITAL EAST BOYNTON BEACH FL 955.7 571.0 384.7
2020-08-21 100002 100002 BETHESDA HOSPITAL EAST BOYNTON BEACH FL 955.4 571.0 384.4
2020-09-25 100002 100002 BETHESDA HOSPITAL EAST BOYNTON BEACH FL 939.1 555.3 383.8
2020-09-18 100002 100002 BETHESDA HOSPITAL EAST BOYNTON BEACH FL 954.6 571.0 383.6
2020-10-09 100002 100002 BETHESDA HOSPITAL EAST BOYNTON BEACH FL 954.3 571.0 383.3

And the top 30 unique facilities by number of anomalous rows reported:
Screen Shot 2020-12-14 at 11 59 20 AM

Scatter-plotting all facilities with a bed difference >0 and separating visualizing by collection_week, we also see that most of the anomalies occur in the first part of the time series. The count drops after the first week of 2020-07-31 and is significantly reduced from 2020-10-30 onwards. This also makes sense given the HHS Protect system was adopted by hospitals for reporting in mid-July and it has a learning curve to report correctly (also while in the midst of surging COVID patient volumes in mid/late-summer for many of these places):

facility_anomalies
(Plotting on log scale to better visualize the smaller facilities in which most of these anomalies are occurring)

I agree with you that it's likely that many of these errors are because "there is serious confusion and varied interpretation by the staff providing this data as to what the data is supposed to mean". And that these anomalous facilities & rows should be set aside for further inquiry and potential correction, especially the ones with the biggest deltas and absolute value which will distort analysis.

But your conclusion that this is "rife with anomalies" is disproportionate to the magnitude of the potential data errors highlighted here. I think we risk throwing out the baby with the bath water when we dismiss the whole dataset because of a very small number of errors that may seem glaring at first. I've never encountered a perfectly clean dataset that's compiled from thousands of independently reporting entities every day.

I think HHS knows this too, as written about in their release blogpost:
"We opted not to have perfect be the enemy of good, so these datasets will have imperfections. To continue improving the quality of data, we welcome your feedback. When more people access and use the data, we have more collective ability to identify gaps, errors, or other problems with these COVID-19 datasets."

It sounds like they really welcome careful, data-savvy folks like you and I to be combing through the data as such and helping them and hospitals to highlight and correct reporting issues. All this towards an ever-improving dataset and clearer, better understanding of where our hospitals and healthcare workers are being most impacted by COVID-19.

@oisact
Copy link
Author

oisact commented Dec 14, 2020

Thanks for your detailed response. I do want to point out that the data that are obviously wrong to us would only be the tip of the iceberg. For example with patient census counts, it is only obvious the patient count is wrong because it significantly exceeds the number of beds. Statistically, there must be even more hospitals who are similarly reporting these numbers wrong, but because their census numbers are lower, the patient count does not exceed the bed count. My local hospital is only at 35% capacity, thus there is a very, very large margin for incorrect patient count reporting (nearly 300% error) before they would exceed the total bed count and we would be aware there is an issue.

Looking at my local hospital, for which I am quite familiar, I see other data issues.

All hospital beds
Total number of all staffed inpatient and outpatient beds in your hospital, including all overflow, observation, and active surge/expansion beds used for inpatients and for outpatients (includes all ICU, ED, and observation).
All hospital inpatient beds
Total number of staffed inpatient beds in your hospital including all overflow, observation, and active surge/expansion beds used for inpatients (includes all ICU beds). This is a subset of #2.

The first metric, All hospital beds, includes outpatient (pre-surgery and post-surgery beds) and ED beds, while the second metric, All hospital inpatient beds, should not include either of those counts. Thus no hospital that has an ED, or that has outpatient surgery, will have the same value for total_beds_7_day_avg and inpatient_beds_7_day_avg. My local hospital has the same count for total beds, inpatient beds and adult inpatient beds, which I know is wrong because it has both an ED and surgery (my wife works at the facility, so I'm trying to validate this data against the actual facility). In the database I show 2,461 distinct CCNs for which those two bed counts are equal (28,876 records). I'm sure that there is some small percentage of facilities that have neither ED nor surgery, but it would be a very small percentage of the hospitals.

I want to use this data, and use it to bring an awareness to the users of our apps that hospitals are seeing significant numbers of COVID patients. However I will not present data that does not represent what I say it represents ("The hospital is operating at 90% inpatient capacity" when it isn't, or "143% of ICU patients have COVID", etc). I can see how this data is still useful for research purposes, studying trends, looking at overall averages across the state or country level. However in my case I will be breaking this data down to individual hospitals and counties, and at that level these data issues would be glaringly inaccurate for too many facilitities.

@ftrotter ftrotter closed this as completed Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants