See Election2_dataset.ipynb, final.csv & covidRegressionandRF.ipynb
Members: Jimmy Greer, Ben Altshuler, TeKisha Sampson & Jason Goddard
Comm Protocol:
- Trello for PM, Slack, Group-Text, email
- Use one of the above as heads-up on working status, new branches, major commits, new files, pull requests, and merge conflicts
Objective: Did counties with mask mandates see fewer COVID-19 cases than those without? Can we find other more relevant features that suggest a relation between factors and death rate? The analysis will be based on the cases as a percentage of the population.
Intro Deck: Red Zone's COVID-19 Mask Mandate Intro
CSV | Keys | Summary |
---|---|---|
POPULATION_TEST.csv | FIPS | population in 2020 by county in the United States |
us-counties.csv | FIPS | the cumulative daily 2020 cases and deaths by county taken on the last day of the year |
county_mask_mandate_data.csv | county_fips state_fips | mask mandates in counties in the United States defined by whether a mask mandate was implemented |
elec_results_2020.csv | state_fips | designation of red/blue states by 2020 presidential election |
Pandas, Matplotlib, Sklearn, PostgreSQL 13.x & Numpy
Most source data in csv format. Pandas reads in from the sources as separate dataframes before being cleaned and merged. The merge of county population takes place in SQL on the Postgres instance initiated by the user. Finally, a complete csv is passed to the ML segment.
-
County Mask Mandate
-dropping multiple columns
-county_start_date to 1 or 0 in new column
-add column for duration -
US Counties
-dropping multiple columns
-groupby counties to get sum of cases and deaths -
Population Test
-concatenate state and county codes into fips -
Election Results
-ETL strip
-merge by category into County Mask Mandate data on state_fips
Merging on fips keys to bring in population in order to fairly measure features against the percentages of cases in counties.
Note: while much of the merging was done in Python, the below shows a simple ERD of the mapping that we could work from throughout.
See covidRegressionandRF.ipynb
Using a classification model, Logistic Regression, we'd like to see if we can predict the likelihood of infection in a county with a mask mandate. We'd like to pinpoint correlation by adding population size & 2020 presidential election results as features. Because of the manageable size of the data, we believe that we can employ Logistic Regression from the start.
Logistic Regression Classification Reports:
As you can see, with high f1-scores across the bins and a accuracy rate of 92.9%, this would be a good predictive model given that we have the mask mandate, Blue/Red State status and the population of a given county.
Additionally, we ran a Random Forest model. With an f1-score of .60, we would lean towards the Logistic Regression above.
To narrow down the causation, we took a simple approach with linear association. The results closely matched what we perceived in the data visualization and gave insight into the driving feature in the dataset (mask mandate).
2020 COVID-19 Analysis. Cases, Deaths & Mandates via Tableau
Consider:
-Filtering Mask Mandates map down to 0-5% as well as 10-29%
-Highlighting Case % and Death % maps by mask mandate
These visualizations tell a story that aligns with the data analysis.
- final.csv
-via Election2_dataset.ipynb