Forecasting COVID-19 cases for the next 7 days and beyond

Accurate forecasting of COVID-19 cases is critical for epidemiological, economic and personal coping strategies, and thus it poses an important challenge for data scientists who work on time series analysis and forecasting. Here we built four different kinds of models to predict the number of COVID-19 cases for the next seven days in 23 countries. We trained the models based on the features that are readily available to maximize the usability of our models, namely we used Google mobility, weather, vaccination, previous cases, and temporal data (e.g. year, month, day etc.) as features. We compared performance of the models with cross validation. We conclude that the neural network model with LSTM layers outperform others, however, the XGBoost regressor model might be considered for a faster outcome with a comparable performance.

Motivation

This plot illustrates dynamic time-varying patterns of COVID-19 cases (cases per million) in 5 different countries. Our goal is to construct a variety of models that predict future cases based on the prior cases and other relevant features like weather, datetime, mobility in various domains (e.g. parks, groceries, work places etc.). We built four different types of models:

SARIMAX
XGboost regression
Multi-layer perceptron
Long Short Term Memory networks (LSTM)

Each model was designed to predict cases in the next 7 days per a given day. Walk forward validation for time series data was used to test model performance on unseen data. As there existed weeks to months gap from the train dataset to the validation or the test dataset, respectively, we could see how the models perform when predicting cases far ahead into the future.

Model performance

Once each model is trained, their performance was tested on validation and test sets that were unseen by the model during training. The figure above illustrates the actual number of cases (blue) in Ireland and the cases predicted by the LSTM model (orange) on the validation set.

One major motivation of this project is to compare the predictability of different models, thus we compared how the four models performed on the validation and test sets. The plot above indicates that the LSTM (red) outperformed other model variants. Also, note that the XGBoost regression model (orange) showed a comparable performance, and thus it could be a good alternative at less computational cost than the LSTM model.

Get Started

Download the data:
Data from 23 countries comprising the features and target (daily cases per million per country) of our models are preprocessed and saved in a dictionary and pickled as 'covid_country_data.pickle'. Please follow this link to download the preprocessed data: link_to_preprocessed_data, which would be handy to kick-start building up on our codes.
To install clone the repo:

git clone git@github.com:parkjlearning/covid19_forecasting.git

Additional info.

📄 Please find the final report of this project here: Final report
💻 Please find the final presentation of this project here: Final presentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Forecasting COVID-19 cases for the next 7 days and beyond

Motivation

Model performance

Get Started

Additional info.

Files

README.md

Latest commit

History

README.md

File metadata and controls

Forecasting COVID-19 cases for the next 7 days and beyond

Motivation

Model performance

Get Started

Additional info.