- Project Statement
- Data
- Time series description
- Smoothing Method
- Decomposition Method
- Modelling with a SARIMA model
- Conclusion
The objective of this assignment was to be able to make reliable forecasts of the last 12 months using different techniques and models.
First, I will analyze the time series checking for seasonality, trend and stationarity. After that, I will apply smoothing methods and decomposition methods to make forecasts.
In the end, I will try to fit the time series into SARIMA models that will allow me to make accurate forecasts.
This dataset represents the monthly Total U.S energy related CO2 emissions from January 1973 to January 2021 in million metric tons.
We can see a growing trend on the beginning until 2018 and then a decreasing trend from that year forward. In addition, we can see the seasonality that is present in the data.
The dataset it is divided into several types of emissions like the ones from Coal, Natural Gas and Petroleum but the data that I used is the total of all the types of emissions present on the dataset.
The dataset was downloaded from https://www.eia.gov/todayinenergy/detail.php?id=44837
We can see different trends along the time, being the biggest one between 1983 and 2008, on the beginning of 2020 we can see a big drop that must be due to the beginning of the Covid-19 pandemic, so I decided to make the analysis without the last 12 months of observation. With this analysis, I can conclude that this time series is not stationary.
As can be seen before, the ACF not meaningful because the time series is not stationary.
The variance of the data seems to vary in some points of the time series, to stabilize it I used a Box-Cox transformation taking the log of the data that can also help to remove some non-linearity on the trend that exists.
Below can be seen the time series with the log transformation and it does not appear to have a big impact on the time series. Using the Box-Cox function in R to see if it gave a better result than the log transformed data, I could see that no major difference was found.
In addition, to see the seasonality that is present in the data, I computed the below two plots.
Studying the monthly increases in the log transformed data and its ACF plot I can state that the monthly increases are stationary around zero and seasonal.
The annual increases looks approximately stationary around a mean, that is zero.
Because the seasonality does not have the same dimension every year and it seems to depend on the increasing or decreasing trend, I think the most adequate model is the Winter’s multiplicative model. Although the observed differences are not significant, I think the multiplicative model will be a bit better than the additive model.
Below we can see the results of the application of the model and after that, we can check the plot of the predictions with the observed data and we can see that the fit is generally good. The results of the forecast are below too.
Checking the accuracy of the forecast, we can see that for the MAPE measure the accuracy of the test set is a bit more than 4,06%, which is a good value. Comparing with the accuracy of the application of the additive model, that is 3,99%. I get a slightly better forecast with the multiplicative model, as I was predicting on the beginning.
Classical Method
As I concluded before, the best model for this time series is the multiplicative model, so I’m going to make the decomposition of this time series. Below I show the graphical representation of the observed values, the trend (that has the cycle included), the seasonal component and the errors.
The values for each component for the last 5 years and the seasonally adjusted data are below, for the trend and error components, we can see that for the first six and last six months no value is available because they were computed through a 12 terms centered MA.
Seasonal and Trend decomposition using Loess - STL method
To compute this method in R I used the “stl” function with the log of the data to use the multiplicative model, with outer set to 20 and inner to 2. Below can be seen the plot of the entire components and the data.
After making the exp of the data, because it was logged, I can get the values of the three components of the decomposition for the last 12 months.
With this information, I can compute the forecast for the data and then check the accuracy of the predictions. Below I have the values of the forecast that are the exp of the results of the forecast.
And to compute the accuracy, I used these values against the original time series and got good results with 3,71% for the MAPE measure.
Because the time series has seasonality and trend, I had to consider a seasonal AR and/or MA component besides the regular AR and/or MA component. I divided the dataset into a train set and test set, with only the last 12 months of observations, and because the variance seems to change over time, I will consider the log of the data that is more stable.
To know the best SARIMA model, I will study the data, the ACF and PACF plots of the logged data, the differenced data, the annual differenced data and the annual difference of the differenced data.
Analyzing the plots and the output of functions ndiffs() and nsdiffs(), i could see that can be d=0 or d=1 and D=1.
Although the PACF and ACF plots indicate an AR model, it is not easy to clearly indicate a model for this time series.
Starting with a large p and a small P, the first model that i called model1 is sarima(lcde,9,1,2,1,1,2,12). Despite the no correlation of the residuals, I could see that with this model a few parameters are not statistically significant, so I made them equal to zero and re-estimated the model.
After the re-estimation, I had to recalculate the Lung-Box statistic and I could see that almost all the residuals are correlated.
After the analysis of model1 and model2, I can conclude that model1 can explain the serial correlation very well, although, some parameters are not statistically significant. However, model2 is not a very good option to explain the serial correlation.
Comparing the information criteria of both of them, we can see that model1 has lower information criteria and residuals are uncorrelated, so I will keep model1 to produce the forecasts.
Model | Order | NPar | AIC | AICc | BIC |
---|---|---|---|---|---|
model1 | (9,1,2)x(1,1,2)[12] | 14 | -2379.35 | -2378.44 | -2314.98 |
model2 | (9,1,2)x(1,1,2)[12] | 14-9 | -2165.8 | -2165.64 | -2140.05 |
I tried other values of the orders of the seasonal components of the model, but it did not produced better results. However, increasing the order of the MA component, lowering the order of the AR component and removing the differencing of the regular components, I was able to get a good model that I hope can produce good forecasts.
Using model3 as sarima(lcde,1,0,7,2,1,1,12), I could get good results, with almost all the residuals uncorrelated, but with some parameters being statistically not significant.
After the re-estimation, I had to recalculate the Lung-Box statistic, as before, I could see that the residuals are almost all uncorrelated and all the parameters are statistically significant.
After analyzing these two models, I could see that model4 has a better AIC, so I will keep the model to produce forecasts.
Model | Order | NPar | AIC | AICc | BIC |
---|---|---|---|---|---|
model3 | (1,0,7)x(2,1,1)[12] | 11 | -2395.76 | -2395.16 | -2344.23 |
model4 | (1,0,7)x(2,1,1)[12] | 11-5 | -2397.25 | -2397.04 | -2367.2 |
After trying to tune the parameters of model3, I noticed that this one is the better one. Therefore, I will use model1, model3 and model4 to produce forecasts for this time series, with model4 being the one with lower information criteria.
The results for the forecast accuracy, for the training set and the test set, can be obtained, as can be seen below.
For the training set, I got the following accuracy measures:
Model | RMSE | MAE | MAPE | MASE |
---|---|---|---|---|
model1 | 0.0254 | 0.0193 | 0.3174 | 0.5869 |
model3 | 0.0253 | 0.0195 | 0.3217 | 0.5945 |
model4 | 0.0255 | 0.0197 | 0.3254 | 0.6013 |
All the models have a similar performance, with model4 being worse than the others on all measures, because information criteria penalizes models with the smallest number of parameters.
For the test set, the measures are the one below:
Model | RMSE | MAE | MAPE | MASE |
---|---|---|---|---|
model1 | 0.0455 | 0.0382 | 0.6311 | 1.1623 |
model3 | 0.0440 | 0.0374 | 0.6188 | 1.1399 |
model4 | 0.0437 | 0.0371 | 0.6140 | 1.1311 |
However, in the test set is model4 with the better performance, that was the worst of the 3 models on the training set, with the best accuracy measures with MAPE of 0.614%.
Below I show the 95% confidence intervals in the forecasts for the best performing model, in the test set that is model4.
Analyzing the plots of the forecasts with the confidence intervals and the observed values, I noticed that all the models performed similarly. With forecasts that overestimate the observed values in the last months, but in other months they achieved very good forecasts.
For a better visualization, I only set the plots for the period after 2019.
In the Graphic below, we can see that the forecast was very good until after August, where the forecasting errors increased a lot.
Finally, we can have the multistep ahead forecasts for the 3 models, as can be seen below.
After doing this analysis, I can conclude that the SARIMA models were very accurate on forecasting and that the most accurate forecasts can be modeled by several different models and finding the best one is difficult.
As I stated before, model4 is the best from all the models that I used, but the other ones can make good predictions too. Despite being the model with less parameters.
For future work, I would like to use multivariate time series analysis to the decomposed the data that I used.
© 2024 Victor Malheiro