forked from paezha/covid19-environmental-correlates
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
1493 lines (1300 loc) · 111 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: github_document
always_allow_html: true
bibliography: References.bib
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
# A spatio-temporal analysis of the environmental correlates of COVID-19 incidence in Spain
Antonio Paez (McMaster University)
Fernando A. Lopez (Universidad Politecnica de Cartagena)
Tatiane Menezes (Universidade Federal de Pernambuco)
Renata Cavalcanti (Universidade Federal de Pernambuco)
Maira Galdino da Rocha Pitta (Universidade Federal de Pernambuco)
**Geographical Analysis (Early View) https://doi.org/10.1111/gean.12241**
**Pre-print available [here](https://github.com/paezha/covid19-environmental-correlates/blob/master/Environmental-Correlates-of-COVID19-Spain/Environmental-Correlates-of-COVID19-Spain.pdf)**
## Abstract
The novel SARS-CoV2 has disrupted health systems and the economy, and public health interventions to slow its spread have been costly. How and when to ease restrictions to movement hinges in part on whether SARS-CoV2 will display seasonality due to variations in temperature, humidity, and hours of sunshine. Here, we address this question by means of a spatio-temporal analysis in Spain of the incidence of COVID-19, the disease caused by the virus. Use of spatial Seemingly Unrelated Regressions (SUR) allows us to model the incidence of reported cases of the disease per 100,000 population as an interregional contagion process, in addition to a function of temperature, humidity, and sunshine. In the analysis we also control for GDP per capita, percentage of older adults in the population, population density, and presence of mass transit systems. The results support the hypothesis that incidence of the disease is lower at higher temperatures and higher levels of humidity. Sunshine, in contrast, displays a positive association with incidence of the disease. Our control variables also yield interesting insights. Higher incidence is associated with higher GDP per capita and presence of mass transit systems in the province; in contrast, population density and percentage of older adults display negative associations with incidence of COVID-19.
## Keywords
SARS-CoV2
COVID-19
Seasonality
Temperature
Humidity
Population density
Older adults
Spatial SUR
Contagion
Spain
```{r load-packages, include=FALSE}
# Download and install these two packages:
# covid19env
# spsur
library(covid19env)
library(ggthemes)
library(gridExtra)
library(kableExtra)
library(lubridate)
library(plm)
library(sf)
library(spdep)
library(spsur)
library(tidyverse)
library(units)
```
```{r load-data, include=FALSE, cache=TRUE}
# Load data from package `covid19env`
data("covid19_spain_1") # Use data sourced from one representative station for the province
#data("covid19_spain_2") # Use data with the average of all stations in provinces
data("provinces_spain")
```
```{r data-preparation, include=FALSE, cache=TRUE}
# Convert GDP per capita to thousands of euros:
provinces_spain <- provinces_spain %>%
mutate(GDPpc = GDPpc/1000)
# Join provincial data to incidence data and convert to simple features:
covid19_spain <- covid19_spain_1 %>%
#covid19_spain <- covid19_spain_2 %>% # data with the average of all stations in provinces
left_join(provinces_spain %>% st_drop_geometry(),
by = c("province", "CCAA", "ID_INE"))
```
```{r sort-communities-by-latitude, include=FALSE, cache=TRUE}
# To visualize the distribution of temperature by CCAA, we sort the communities by latitude, from north to south. We do this by community instead of province because the large number of provinces clutters the plots.
# Dissolve internal provincial boundaries to obtain a simple features object with the autonomous communities:
ccaa.sf <- provinces_spain %>%
group_by(CCAA) %>%
summarize(provinces = n())
# Extract coordinates of autonomous communities
ccaa.coords <- ccaa.sf %>%
st_centroid() %>%
st_coordinates() %>%
as.data.frame()
# Join Y coordinate to ccaa.sf
ccaa.sf <- ccaa.sf %>%
mutate(long = ccaa.coords$Y)
# Sort autonomous communities from north to south
ccaa.levels <- ccaa.sf %>%
arrange(desc(long)) %>% select(CCAA)
ccaa.levels <- as.character(ccaa.levels$CCAA)
# Relevel autonomous communities
covid19_spain <- covid19_spain %>%
mutate(CCAA = factor(CCAA, levels = ccaa.levels, ordered = TRUE))
```
Introduction {#introduction}
==========================
From a small outbreak linked to a live animal market at the end of 2019 to a global pandemic in a matter of weeks, the SARS-CoV2 virus and COVID-19, the disease it causes, have threatened to overrun health systems around the world. In efforts to contain the spread of the disease, numerous governments in many regions and nations have either recommended or mandated social distancing measures, and as of this writing, millions of people in five continents shelter in place. There are encouraging signs that these measures have mitigated the spread of the virus [e.g., @Lancastle2020impact; @Lewnard2020scientific; @Wilder2020isolation]. Even so, this has come at a high cost, and the consequences for all spheres of economic, social, and cultural life have been dire [e.g., @Fernandes2020economic; @Luo2020how]. As a result, there is a sense of urgency to anticipate the progression of the pandemic, in order to plan for progressive lifting of restrictions to movement and social contact [e.g., @Kissler2020projecting]. Needless to say, this has become a delicate, and politically charged, balancing act between public health and the economy [@Gong2020balance].
A salient question in the debate about how and when to ease social distancing measures is whether the virus will display seasonal variations. Existing research on similar pathogens suggests that the virus could be more stable and potentially easier to transmit in conditions of low temperature and low humidity. While this is encouraging, it is important to keep in mind that "not all seasonal respiratory viruses experience the same spatiotemporal patterns" [@deangel2020weathering, section 4]. This urges caution when extrapolating from known viruses. The evidence in this respect is as yet inconclusive, and although easing restrictions as the weather warms may appear tempting, doing so prematurely could well undo weeks or possibly months of costly measures.
It is not surprising, given the stakes involved, that this issue has already triggered a lively debate. The current state of knowledge was well-summarized by the National Academy of Sciences, Engineering, and Medicine in the U.S. in a recent report [see @National2020rapid]. Engaged by the Office of the Executive for guidance on this matter, this organization concluded that: "[some] limited data support a potential waning of cases in warmer and more humid seasons, yet none are without major limitations...Additional studies as the SARS-CoV2 pandemic unfolds could shed more light on the effects of climate on transmission" (p. 6). To further complicate matters, much of the relevant work has yet to be peer-reviewed [see for instance the challenge of @Harbert2020spatial; to @Araujo2020spread].
With the above considerations in mind, our objective with this paper is to investigate the influence of environmental factors, concretely temperature, humidity, and sunshine, on the progression of the pandemic. We adopt a population health approach, and report results from a spatio-temporal model of the incidence of COVID-19 in the coterminous provinces in Spain, one of the countries hardest hit by the pandemic. We combine data on reported cases of the disease with metereological information, to create a spatio-temporal dataset covering a period of 30 days. We then join this dataset with provincial-level economic and demographic information to act as controls to our key environmental variables. These data are analyzed using a spatial Seemingly Unrelated Regressions (SUR) approach, which allows us to model incidence of COVID-19 as a contagion process.
The results provide evidence of the effect of temperature, humidity, and sunshine on the incidence of COVID-19. The clearest result with respect to these variables is a lower incidence of COVID-19 at higher temperatures and levels of humidity, while the opposite happens with respect to hours of sunshine. Our control variables also provide some intriguing insights. Higher incidence is associated with higher GDP per capita and presence of mass transit systems in the province; in contrast, population density and percentage of older adults display negative associations with incidence of COVID-19. The results of this analysis provide support to the hypothesis of seasonality of the novel SARS-CoV2, and should be of interest to public health officials and policy makers grappling with the question of the trajectory of the pandemic.
Please note that this paper is prepared as a reproducible research document. The source `R` markdown document, as well as all data and code needed to reproduce/review/extend the analysis can be obtained from a public repository\footnote{\url{https://drive.google.com/drive/u/1/folders/1d_40N_QXo2Fl14r3T3CHs84IED-B1rV_}}.
Background {#background}
============
The global emergence of infectious diseases is mostly driven by environmental, ecological, and socio-economic factors [@Jones2008global]. In the case of SARS-CoV2, the ecological factors include the interaction between humans and wildlife. Once transmission of a disease begins to happen between humans, socio-economic and environmental factors become increasingly important. As noted in the introduction, the focus of the paper is on environmental variables, concretely three related to meteorological conditions: temperature, humidity, and sunshine.
Much of what is known about the potential seasonality of SARS-CoV2 is a result of research on other pathogens. Earlier, diverse studies have shown the effect of temperature and humidity on the incidence of influenza [e.g., @Makinen2009cold; @Jaakkola2014decline; @Kudo2019low]. Jaakkola et al. [-@Jaakkola2014decline], for example, found that a decrease of temperature and absolute humidity precedes the onset of symptoms of influenza A and B viruses by 3 days in places where the temperature is low. After the 2002-2004 outbreak of SARS, researchers also began to investigate the relationship between these factors and SARS-CoV [@Casanova2010effects; @Chan2011effects]. Casanova et al. [-@Casanova2010effects], for instance, used surrogates to find that virus inactivation was likely more rapid at higher temperatures; in terms of humidity, these researchers reported that survival of the virus was lower at moderate relative humidity levels. Chan et al. [-@Chan2011effects] also found that viability of the virus that causes SARS is also lost at higher temperatures (>38°C) and relative humidity superior to 95%.
Whether results from laboratory experiments will hold when the virus circulates in the community remains uncertain. At a global scale, de Ángel Solá et al. [-@deangel2020weathering] see less risk from SARS-CoV2 in the Caribean Basin; on the other hand, Coelho et al. [@Coelho2020exponential] warn that at least during the exponential phase, expansion of the virus is not driven by climate. Similarly, whereas Araujo and Naimi [-@Araujo2020spread] argue that spread of SARS-CoV2 will likely be constrained by climate, Harbert et al. [-@Harbert2020spatial] remain unconvinced that spatial modelling can currently discriminate the distribution of the disease on the basis of climate, at least in the United States. Yao et al. [-@Yao2020association], examined data from China and came to the conclusion that neither temperature nor ultraviolet indices had an association with transmission of COVID-19. This is despite previous research that has linked less exposure to UVB radiation to higher prevalence and severity of acute respiratory tract infections [@Zittermann2016vitamin; @Dąbrowska2018vitamin; @Dinlen2016association; @Mathyssen2017vitamin; @Esposito2015vitamin; @Jat2017vitamin; @Moriyama2020seasonality].
In addition to the environmental variables above, from a population health perspective it is also important to account for potential socio-economic and demographic confounders.
To account for population-level factors, the first variable that we consider is GDP per capita. Much has been written about globalization and the spread of infectious disease\footnote{As the Globe and Mail, Canada's paper of record, put it in relation to the SARS outbreak in 2003: "Globalization means that if someone in China sneezes, someone in Toronto may one day catch a cold" (Editorial, March 28, 2003, p. A18)}. The growth in global connections has presented a challenge to spatial approaches in the initial stages of disease management, when the cause of a disease may still be unclear but the plane has already departed [@Zhou2016accelerated]. In reference to the earlier outbreak of SARS, van Wagner [-@vanWagner2008toward] chronicles how Toronto's status as a global city turned out to be a vulnerability in this respect. In our case, we think of GDP per capita as a marker of a region's relative position in a network of global cities, and its potential to be further ahead in the trajectory of the pandemic. Furthermore, wealthier regions also tend to concentrate more activities that produce non-traded goods, including building and construction [@Hallet2002regional]. Therefore, it is possible that wealthier regions remain relatively more active even during a lockdown. On the other hand, we cannot discount the possibility that less wealthy regions have a higher proportion of workers in manual occupations who cannot telework, and therefore have more difficulties complying with shelter-in-place orders.
Secondly, we consider percentage of older adults (over 65) in a region. Early evidence regarding COVID-19 suggests that the case rate mortality is higher at older ages [e.g. @Novel2020epidemiological]. However, it is not clear that a relatively large population of older adults necessarily translates into higher transmission rates of the infection. The tool of choice in containing the spread of the disease has been social distancing. In this respect, the evidence from the field of transportation is that older adults tend to travel less frequently, for shorter distances, and have higher rates of immobility than most everyone, except the youngest members of the public [e.g., @Roorda2010trip; @Morency2011distance; @Sikder2012immobility]. In other words, many older adults are, whether by preference or otherwise, already in a form of social isolation. Social distancing during the pandemic may actually reinforce that condition for them, as suggested by the analysis of age-structured social contact in India, China, and Italy of Sing and Adhikari [-@Singh2020age]. Since the age-structured matrix of social contact in Spain is similar to Italy [see @Prem2017projecting], our expectation is that populations with higher percentages of older adults will tend to have lower levels of social contact and hence of incidence.
Population density is also relevant since it directly affects the contact patterns and contact rates between individuals in a population [@Hu2013scaling]. The evidence available suggests a positive relationship between the transmission of COVID-19 and population density (e.g. cumulative incidence in urban areas like NYC). For this reason, we anticipate a positive relationship between population density and the incidence of the disease.
The last variable that we consider as a control is the presence of mass transit systems in a province. Every province in Spain offers some form of public transportation, but only five provinces have higher order systems of mass mobility (e.g. metro or subway), namely Barcelona, Madrid, Sevilla, Valencia, and Bizkaia. Public transportation has been hypothesized to relate to the spread of contagious disease by some researchers using agent-based approaches and simulation [e.g., @Perez2009agent; @Wang2010gis], and while we find scant evidence of a link in the literature, the idea is intuitively appealing. After all, unlike the isolation that a car offers to travellers, most mass transit system are cauldrons of social contact.
Context and Data {#context-and-data}
===================
## Covid-19 in Spain {#covid-19-in-spain}
The first reported case of COVID-19 in Spain was on January 31st, 2020, when a German tourist in the Canary Islands tested positive for the virus. After this case, it was still a few weeks before the first domestic case was reported, on February 27th in Sevilla province (Andalusia). In a short period of time, as testing started to ramp up, it became clear that an outbreak was flaring. By March 11th the World Health Organization (WHO) declared COVID-19 officially a pandemic. This declaration marked a turning point for the public in Spain too. As of March 13th, the number of cases of COVID-19 reported in Spain was 4,473, with a majority of cases (1,990) concentrated in Madrid: these numbers were at the time the worst outbreak in Europe after Italy. In response to the situation, on March 13th the Spanish National Government declared a state of emergency, to go into effect on Saturday March 14th. As part of the state of emergency restrictions to most activities were imposed, with the exception of essential services (e.g. food, health) and some economic subsectors of industry and construction. A few days later, on March 17th, Spain closed its lands borders to allow entry only to returnee nationals and permanent residents. The lockdown was further hardened between March 30th and April 12th (including the Easter weekend of April 10th-12th) and during this period only essential activities were allowed. During this period, there was a dramatic reduction in overall mobility, both within provinces as between \footnote{https://www.mitma.gob.es/ministerio/covid-19/evolucion-movilidad-big-data/movilidad-provincial}.
## Data {#data}
Our dataset includes information about the daily number of cases of COVID-19 reported in Spain at the provincial level (NUTS3 in Eurostat terminology) for the period between March 13th and April 11th, inclusive. For our purposes, we consider positive cases reported, but exclude symptomatic cases diagnosed by a doctor without a Polymerase Chain Reaction (PCR) test. The Spanish National Government publishes periodic updates at the regional level (NUTS2) and the information is also released at the provincial level as part of a collaborative project by geovoluntarios.com\footnote{https://www.geovoluntarios.org/}, ProvidencialData19\footnote{https://www.datoscovid.es/pages/providencialdata19}, and ESRI España. This information is compiled from several sources, mainly the official web pages of the Spanish regional goverments, as documented in Centro de Datos Covid-19\footnote{https://www.datoscovid.es/pages/sobre-la-iniciativa}. We consider two sets of explanatory variables. The first one, and the focus of this research, are the three environmental variables, collected from official sources (i.g., *AEMET*, the state meteorology agency, and *MAPA*, the ministry of agriculture, fisheries, and food). The second set provides some relevant controls as discussed above, and are also collected from official sources, i.e., INE, the national statistics institute. Table \ref{tab:descriptive-statistics} shows the descriptive statistics and the provenance of the data used in this research.
The spatial and temporal coverage of the data is as follows. Our dataset begins on March 13, which is the first date when every province had reported at least one case of COVID-19, and continues until April 11, for a period of 30 days. The unit of analysis is the province. Provinces are the equivalent of states, and are embedded in Autonomous Communities. As an example, Cataluña is an Autonomous Community and consists of four provinces, namely Barcelona, Gerona, Lerida, and Tarragona. The size of the provinces is relatively large, as seen in Table \ref{tab:descriptive-statistics}. The smallest province is $1,978.12km^2$ (this is smaller than Rhode Island in the U.S.) and the largest province is $21,767.93km^2$ (slightly smaller than New Jersey in the U.S.). While this is a relatively large degree of spatial aggregation, reporting on COVID-19 is not yet consistent at smaller geographies, or cases are not reported at that level at all.
```{r descriptive-statistics, echo=FALSE, cache=TRUE}
data.frame(Variable = c("COVID-19 Incidence",
"Area",
"GDPpc",
"Older",
"Population Density",
"Mean Temperature",
"Humidity",
"Sunshine"),
Note = c("Incidence in reported cases of SARS-19 per 100,000 people",
"Area of province in sq.km",
"GDP per capita in €1,000s",
"Percentage of people aged 65 and older in the province",
"Population density in the province in people per sq.km",
"Mean temperature in province by date, in °C",
"Relative humidity in province by date",
"Daily hours of sunshine in province by date"),
Min = c(min(covid19_spain$Incidence),
min(provinces_spain %>%
st_area() %>%
set_units(km^2)),
min(provinces_spain$GDPpc),
min(provinces_spain$Older),
min(provinces_spain$Density),
min(covid19_spain$Mean_Temp),
min(covid19_spain$Humidity),
min(covid19_spain$Sunshine_Hours)),
Mean = c(mean(covid19_spain$Incidence),
mean(provinces_spain %>%
st_area() %>%
set_units(km^2)),
mean(provinces_spain$GDPpc),
mean(provinces_spain$Older),
mean(provinces_spain$Density),
mean(covid19_spain$Mean_Temp),
mean(covid19_spain$Humidity),
mean(covid19_spain$Sunshine_Hours)),
Max = c(max(covid19_spain$Incidence),
max(provinces_spain %>%
st_area() %>%
set_units(km^2)),
max(provinces_spain$GDPpc),
max(provinces_spain$Older),
max(provinces_spain$Density),
max(covid19_spain$Mean_Temp),
max(covid19_spain$Humidity),
max(covid19_spain$Sunshine_Hours)),
SD = c(sd(covid19_spain$Incidence),
sd(provinces_spain$GDPpc),
sd(provinces_spain %>%
st_area() %>%
set_units(km^2)),
sd(provinces_spain$Older),
sd(provinces_spain$Density),
sd(covid19_spain$Mean_Temp),
sd(covid19_spain$Humidity),
sd(covid19_spain$Sunshine_Hours)),
Source = c("ProvidencialData19",
"INE",
"INE",
"INE",
"INE",
"AEMET",
"MAPA",
"AEMET")) %>%
kable(#"latex",
"html",
booktabs = TRUE,
digits = 2,
caption = "\\label{tab:descriptive-statistics} Descriptive statistics") %>%
#kable_styling(latex_options = c("striped", "scale_down")) %>%
kable_styling(bootstrap_options = c("striped", "condensed")) %>%
column_spec(2, width = "15em") %>%
column_spec(7, width = "5em") %>%
footnote(general = c("ProvidencialData19: https://www.datoscovid.es/pages/providencialdata19",
"INE (Instituto Nacional de Estadistica): https://www.ine.es/",
"AEMET (Agencia Estatal de Meteorologia): http://eportal.mapa.gob.es",
"MAPA (Ministerio de Agricultura, Pesca y Alimentacion): http://eportal.mapa.gob.es"))
```
An important aspect of working with environmental data such as temperature, humidity, and hours of sunshine, is the incubation period of the disease. Lauer et al. [-@Lauer2020incubation] report for the case of COVID-19 a median incubation period of 5.7 days (with a confidence interval between 4.9 to 7.8 days). The vast majority of cases (95%) develop symptoms between 2.6 days (CI, 2.1 to 3.7 days) and 12.5 days (CI, 8.2 to 17.7 days). For this reason, we judge it best to use lagged values of the environmental variables. We test different time lags as follows. We consider a lagged 8-day average, from date-minus-12 to date-minus-5 days (hereafter *lag8*). Secondly, we consider a lagged 11-day average, from date-minus-12 to date-minus-2 days (hereafter *lag11*). Finally, to account for the likely duration of incubation, we consider a lagged 11-day *weighted* average, from date-minus-12 to date-minus-2 days (hereafter *lag11w*). The weights for this variable are calculated using the parameters of the log-normal distribution reported by Lauer et al. [-@Lauer2020incubation], i.e., a log-mean of 1.621 and a log-standard deviation of 0.418. With these weights, the environmental variables at date-minus-2 and date-minus-12 days are weighted as 0.041 and 0.009 respectively, whereas the environmental variables at date-minus-5 days are weighted as 0.194. These weights have the effect of changing the contribution of daily values to the lagged moving average. For instance, the temperature at date-minus-4-days is weighted more heavily than the temperature at date-minus-10-days, as a closer approximation of the conditions at the most likely time of contagion before the disease became manifest.
Methods: the Spatial SUR Model {#methods}
===================
The Seemingly Unrelated Regression equations model (SUR hereafter) is a multivariate econometric model used in different fields when the structure of the data consists of cross-sections for different time periods. The basis of this approach is well-known since the initial works of Zellner [-@Zellner1962efficient], and has become a popular methodology included in several econometrics textbook [e.g., @Greene2003econometric]. To our knowledge, Anselin [-@Anselin1988spatial] was the first author to discuss SUR from a spatial perspective, in the context of spatio-temporal analysis. In his landmark text, Anselin discussed a model made of "an equation for each time period, which is estimated for a cross section of spatial units" (p. 141). From this milestone, a large body of research has developed to extend the classical SUR into a spatial framework [e.g., @Rey1999us; @Lauridsen2010spatiotemporal; @Legallo2006evaluating; @Lopez2017spatial].
The classical SUR model without spatial effects (from here, SUR-SIM) is a stack of equations as follows:
<!--
\begin{equation}
\label{eq:sur-sim}
\begin{bmatrix}
y_1 \\ y_2 \\ \vdots \\ y_T
\end{bmatrix}
=
\begin{bmatrix}
X_1 & 0 & \cdots & 0 \\ 0 & X_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & X_T
\end{bmatrix}
\
\begin{bmatrix}
\beta_1 \\ \beta_1 \\ \vdots \\ \beta_T
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_T
\end{bmatrix}
\end{equation}
\noindent where $y_{t}=(y_{1t},...,y_{Nt})$ is a $N \times 1$ vector, and in our case $y_{st}$ is the incidence ratio in the province $s$ ($s=1,...,N$) the day $t$ $(t=1,...,T)$; $X_t$ is a $N \times k_t$ matrix of the $k_t$ independent variables, with associated vector of coefficients $\beta_t$,; $\beta_t=(\beta_{1t},...,\beta_{Nt})$ is a vector of coefficients and $\epsilon_t=(\epsilon_{1t},...,\epsilon_{Nt})$ is the vector of residuals.
-->
<!--Equation for README.md generated using this app https://alexanderrodin.com/github-latex-markdown/ -->
![\begin{equation} \label{eq:sur-sim} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_T \end{bmatrix} = \begin{bmatrix} X_1 & 0 & \cdots & 0 \\ 0 & X_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & X_T \end{bmatrix} \ \begin{bmatrix} \beta_1 \\ \beta_1 \\ \vdots \\ \beta_T \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_T \end{bmatrix} \end{equation}](https://render.githubusercontent.com/render/math?math=%5Cbegin%7Bequation%7D%20%5Clabel%7Beq%3Asur-sim%7D%20%5Cbegin%7Bbmatrix%7D%20y_1%20%5C%5C%20y_2%20%5C%5C%20%5Cvdots%20%5C%5C%20y_T%20%5Cend%7Bbmatrix%7D%20%3D%20%5Cbegin%7Bbmatrix%7D%20X_1%20%26%200%20%26%20%5Ccdots%20%26%200%20%5C%5C%200%20%26%20X_2%20%26%20%5Ccdots%20%26%200%20%5C%5C%20%5Cvdots%20%26%20%5Cvdots%20%26%20%5Cddots%20%26%20%5Cvdots%20%5C%5C%200%20%26%200%20%26%20%5Ccdots%20%26%20X_T%20%5Cend%7Bbmatrix%7D%20%5C%20%5Cbegin%7Bbmatrix%7D%20%5Cbeta_1%20%5C%5C%20%5Cbeta_1%20%5C%5C%20%5Cvdots%20%5C%5C%20%5Cbeta_T%20%5Cend%7Bbmatrix%7D%20%2B%20%5Cbegin%7Bbmatrix%7D%20%5Cepsilon_1%20%5C%5C%20%5Cepsilon_2%20%5C%5C%20%5Cvdots%20%5C%5C%20%5Cepsilon_T%20%5Cend%7Bbmatrix%7D%20%5Cend%7Bequation%7D)
\noindent where ![$y_{t}=(y_{1t},...,y_{Nt})$](https://render.githubusercontent.com/render/math?math=%24y_%7Bt%7D%3D(y_%7B1t%7D%2C...%2Cy_%7BNt%7D)%24) is an ![$N \times 1$](https://render.githubusercontent.com/render/math?math=%24N%20%5Ctimes%201%24) vector, and in our case ![$y_{st}$](https://render.githubusercontent.com/render/math?math=%24y_%7Bst%7D%24) is the incidence ratio in the province ![$s\text{ }(s=1,...,N)$](https://render.githubusercontent.com/render/math?math=%24s%5Ctext%7B%20%7D(s%3D1%2C...%2CN)%24) the day ![$t\text{ }(t=1,...,T)$](https://render.githubusercontent.com/render/math?math=%24t%5Ctext%7B%20%7D(t%3D1%2C...%2CT)%24); ![$X_t=(X^1,...,X^{k_t})$](https://render.githubusercontent.com/render/math?math=%24X_t%3D(X%5E1%2C...%2CX%5E%7Bk_t%7D)%24) is a ![$N \times k_t$](https://render.githubusercontent.com/render/math?math=%24N%20%5Ctimes%20k_t%24) matrix of the ![$k_t$](https://render.githubusercontent.com/render/math?math=%24k_t%24) independent variables, ![$X_i=(X^i_{st})$](https://render.githubusercontent.com/render/math?math=%24X_i%3D(X%5Ei_%7Bst%7D)%24); ![$\beta_t=(\beta_{1t},...,\beta_{Nt})$](https://render.githubusercontent.com/render/math?math=%24%5Cbeta_t%3D(%5Cbeta_%7B1t%7D%2C...%2C%5Cbeta_%7BNt%7D)%24) is a vector of coefficients and ![$\epsilon_t=(\epsilon_{1t},...,\epsilon_{Nt})$](https://render.githubusercontent.com/render/math?math=%24%5Cepsilon_t%3D(%5Cepsilon_%7B1t%7D%2C...%2C%5Cepsilon_%7BNt%7D)%24) is the vector of residuals.
A key feature of the SUR model is the temporal dependence structure among the vectors of residuals, namely:
<!--
\begin{equation}
\label{eq:sur-err}
E[\epsilon_t \epsilon'_{t'}]=\sigma_{tt'}
\end{equation}
Note that this specification is very flexible, in that it allows changes in the coefficients $\beta_{it}$ in order to modulate the effect of the independent variables on $y_t$. This flexibility can be reduced and it is posible to impose restrictions considering some $\beta$ coefficients as being constant over time. In this case, we can reformulate the coefficients expression $\beta_t = (\beta_{1}, \cdots, \beta_{r-1}, \beta_{r}, \beta_{r+1}, \cdots, \beta_{Nt})$ to restrict the first $r$ coefficients to be constant across equations. This is equivalent to specifying some effects to be invariant over time.
-->
![\begin{equation} \label{eq:sur-err} E\[\epsilon_t \epsilon'_{t'}\]=\sigma_{tt'} \end{equation} ](https://render.githubusercontent.com/render/math?math=%5Cbegin%7Bequation%7D%20%5Clabel%7Beq%3Asur-err%7D%20E%5B%5Cepsilon_t%20%5Cepsilon'_%7Bt'%7D%5D%3D%5Csigma_%7Btt'%7D%20%5Cend%7Bequation%7D%20)
Note that this specification is very flexible, in that it allows changes in the coefficients ![$\beta_{it}$](https://render.githubusercontent.com/render/math?math=%24%5Cbeta_%7Bit%7D%24) in order to modulate the effect of the independent variables on ![$y_t$](https://render.githubusercontent.com/render/math?math=%24y_t%24). This flexibility can be reduced and it is posible to impose restrictions considering some ![$\beta$](https://render.githubusercontent.com/render/math?math=%24%5Cbeta%24) coefficients as being constant over time. In this case, we can reformulate the coefficients expression ![$\beta_t = (\beta_{1}, \cdots, \beta_{r-1}, \beta_{r}, \beta_{r+1}, \cdots, \beta_{Nt})$](https://render.githubusercontent.com/render/math?math=%24%5Cbeta_t%20%3D%20(%5Cbeta_%7B1%7D%2C%20%5Ccdots%2C%20%5Cbeta_%7Br-1%7D%2C%20%5Cbeta_%7Br%7D%2C%20%5Cbeta_%7Br%2B1%7D%2C%20%5Ccdots%2C%20%5Cbeta_%7BNt%7D)%24) to restrict the first ![$r$](https://render.githubusercontent.com/render/math?math=%24r%24) coefficients to be constant across equations. This is equivalent to specifying some effects to be invariant over time.
Equation (\ref{eq:sur-sim}) can be rewriten in compact form:
<!--
\begin{equation}
\bf{Y} = \bf{X} \beta + \epsilon
\label{eq:sur-sim-block}
\end{equation}
\noindent where $\bf{Y}$ is now a vector of dimension $NT \times 1$, $\bf{X}$ is a block-diagonal matrix $NT \times K$
(with $K = \sum_t{k_t}$) and $\epsilon$ is an $NT \times 1$ vector. Using the Kronecker product notation ($\otimes$), the error matrix structure is expressed concisely as:
\begin{equation}
E[\epsilon \epsilon']=\bf{\Sigma} \otimes \bf{I_N} ; \ \bf{\Sigma}=(\sigma_{tt'})
\end{equation}
-->
![\begin{equation} y = X \beta + \epsilon ![E\[\epsilon \epsilon'\]=\bf{\Sigma} \otimes \bf{I_N} ; \ \bf{\Sigma}=(\sigma_{tt'})](https://render.githubusercontent.com/render/math?math=E%5B%5Cepsilon%20%5Cepsilon'%5D%3D%5Cbf%7B%5CSigma%7D%20%5Cotimes%20%5Cbf%7BI_N%7D%20%3B%20%5C%20%5Cbf%7B%5CSigma%7D%3D(%5Csigma_%7Btt'%7D))
\noindent where ![$\bf{Y}$](https://render.githubusercontent.com/render/math?math=%24%5Cbf%7BY%7D%24) is now a vector of dimension ![$NT \times 1$](https://render.githubusercontent.com/render/math?math=%24NT%20%5Ctimes%201%24), ![$\bf{X}$](https://render.githubusercontent.com/render/math?math=%24%5Cbf%7BX%7D%24) is a block-diagonal matrix ![$NT \times K$](https://render.githubusercontent.com/render/math?math=%24NT%20%5Ctimes%20K%24)
(with ![$K = \sum_t{k_t}$](https://render.githubusercontent.com/render/math?math=%24K%20%3D%20%5Csum_t%7Bk_t%7D%24)) and ![$\epsilon$](https://render.githubusercontent.com/render/math?math=%24%5Cepsilon%24) is an ![$NT \times 1$](https://render.githubusercontent.com/render/math?math=%24NT%20%5Ctimes%201%24) vector. Using the Kronecker product notation (![$\otimes$](https://render.githubusercontent.com/render/math?math=%24%5Cotimes%24)), the error matrix structure is expressed concisely as:
![\begin{equation} E\[\epsilon \epsilon'\]=\Sigma \otimes I_N ; \ \Sigma=(\sigma_{tt'}) \end{equation}](https://render.githubusercontent.com/render/math?math=%5Cbegin%7Bequation%7D%20E%5B%5Cepsilon%20%5Cepsilon'%5D%3D%5CSigma%20%5Cotimes%20I_N%20%3B%20%5C%20%5CSigma%3D(%5Csigma_%7Btt'%7D)%20%5Cend%7Bequation%7D)
As is the case with cross-sectional data, it is possible to test the residuals of Model (\ref{eq:sur-sim-block}) for spatial autocorrelation, and several tests have been developed to test the null hypothesis of spatial independence [see @Lopez2014spatial]. When the null hypothesis is rejected, several alternative specifications have been proposed to include spatial effects [@Anselin1988spatial; see also @Anselin2016estimation]. In this paper we consider a spatial SUR model that incorporates a spatial lag of the dependent variable as an explanatory factor. Spatial analytical approaches were used to understand contagion-difussion processes in the case of infectious disease in general [e.g., @Cliff1998detecting] and the 2003-2004 SARS outbreak in particular [e.g., @Meng2005understanding; @Cao2010spatio]. While we are mindful of the same caveat that the novel SARS-CoV2 may not follow the patterns of previous diseases, there is still evidence from the United States that COVID-19 displays spatial patterns that are consistent with a diffusion process [@Desjardins2020rapid]. For this reason, the spatial lag model is appropriate to model incidence of COVID-19 geographically, since it accounts for potential spatial patterns that result from a process of contagion, as explained next.
The stack expresion for the SUR model with a spatially lagged dependent variable (SUR-SLM) is as follows:
<!--
\begin{equation}
\label{eq:sur-slm}
\begin{aligned}
\bf{AY} = \bf{X} \beta + \epsilon \\
\epsilon =N(0,\bf{\Sigma})
\end{aligned}
\end{equation}
\noindent where $\bf{A} =I_{TN}-\bf{\Gamma} \otimes W$ is the spatially lagged dependent variable, and $\bf{\Gamma} = diag(\rho_1, \cdots, \rho_T)$.
This specification assumes that outcome ($y_{st}$) at location $s$ and time $t$ is partially determined by the weighted average ($Wy_{st}$) of the outcome in neighboring provinces, with neighborhood defined by matrix $W$ of spatial weights. In other words, the spatially lagged dependent variable represents a process of contagion, where the disease in neighboring provinces can spillover in a spatial way. The coefficients of the spatially lagged variable are estimated for each time period $\rho_t$ and identify the intensity and the sign of the contagion effect. It is possible test the null hypothesis of identical levels of spatial dependence ($\rho_i=\rho_j, \forall i,j$). The correspond Wald test is available in the `R` package `spsur`.
-->
![\begin{equation} \label{eq:sur-slm} \begin{aligned} \bf{AY} = \bf{X} \beta + \epsilon \\ \epsilon =N(0,\bf{\Sigma}) \end{aligned} \end{equation}](https://render.githubusercontent.com/render/math?math=%5Cbegin%7Bequation%7D%20%5Clabel%7Beq%3Asur-slm%7D%20%5Cbegin%7Baligned%7D%20%5Cbf%7BAY%7D%20%3D%20%5Cbf%7BX%7D%20%5Cbeta%20%2B%20%5Cepsilon%20%5C%5C%20%5Cepsilon%20%3DN(0%2C%5Cbf%7B%5CSigma%7D)%20%5Cend%7Baligned%7D%20%5Cend%7Bequation%7D)
\noindent where ![$\bf{A} =I_{TN}-\bf{\Gamma} \otimes W$](https://render.githubusercontent.com/render/math?math=%24%5Cbf%7BA%7D%20%3DI_%7BTN%7D-%5Cbf%7B%5CGamma%7D%20%5Cotimes%20W%24) is the spatially lagged dependent variable, and ![$\bf{\Gamma} = diag(\rho_1, \cdots, \rho_T)$](https://render.githubusercontent.com/render/math?math=%24%5Cbf%7B%5CGamma%7D%20%3D%20diag(%5Crho_1%2C%20%5Ccdots%2C%20%5Crho_T)%24).
This specification assumes that outcome (![$y_{st}$](https://render.githubusercontent.com/render/math?math=%24y_%7Bst%7D%24)) at location ![$s$](https://render.githubusercontent.com/render/math?math=%24s%24) and time ![$t$](https://render.githubusercontent.com/render/math?math=%24t%24) is partially determined by the weighted average (![$Wy_{st}$](https://render.githubusercontent.com/render/math?math=%24Wy_%7Bst%7D%24)) of the outcome in neighboring provinces, with neighborhood defined by matrix ![$W$](https://render.githubusercontent.com/render/math?math=%24W%24) of spatial weights. In other words, the spatially lagged dependent variable represents a process of contagion, where the disease in neighboring provinces can spillover in a spatial way. The coefficients of the spatially lagged variable are estimated for each time period ![$\rho_t$](https://render.githubusercontent.com/render/math?math=%24%5Crho_t%24) and identify the intensity and the sign of the contagion effect. It is possible test the null hypothesis of identical levels of spatial dependence (![$\rho_i=\rho_j, \forall i,j$](https://render.githubusercontent.com/render/math?math=%24%5Crho_i%3D%5Crho_j%2C%20%5Cforall%20i%2Cj%24)). The correspond Wald test is available in the `R` package `spsur`.
The SUR-SLM model can be estimated using maximum likelihood (@Lopez2014spatial) or instrumental variables (@Minguez2019).
A note regarding the interpretation of the model is in order. It is well-known that coefficients in linear regression models are partial derivatives of the dependent variable with respect to the independent variables, and therefore directly give the marginal effects or rates of change. Spatially lagged models, however, are no longer linear. The intuition behind the non-linearity is that the spatial lag expands the information set to include information from neighbouring regions: in other words, the value of an explanatory variable in a spatial unit can have influence in other spatial units via the spatial lag. This makes interpretation of the coefficients less straightforward but also richer [@Golgher2016interpret]. The results of LeSage [-@LeSage2009introduction] for cross-sectional spatial lag models can be extended to the spatial SUR framework. Note that, according to Elhorst [-@Elhorst2014spatial], the partial derivatives have the following interpretation: if an explanatory variable ($X_k$) in a particular province changes, not only the incidence rate in that province changes, also incidence rates in other provinces change via the contagion effect. Therefore, a change in $X_k$ in a particular province has a *direct effect* on that province, but also an *indirect effect* on neighbouring provinces. In this way, the $i$th diagonal element of the matrix of partial derivatives represents the direct effect on the $i$th province, whereas the non-diagonal elements of $i$th row of the matrix of partial derivatives represent the indirect effects on other provinces. In order to obtain a global indicator, the direct effect is calculated as the mean of the diagonal elements and captures the average change in incidence ratio caused by the change in $X_k$. Likewise, a global indicator of the indirect effects is calculated as the mean of the non-diagonal elements. The total effect is the sum of direct and indirect effects. Finally, note that if $\rho_k = 0$, the indirect effects are 0 and the direct effects are equal to $\beta_kt$.
Results {#results}
===================
## Exploratory Data Analysis {#eda}
Figure \ref{fig:weekly-average-incidence-map} shows the geographical variation in the incidence of COVID-19 in Spain, as well as the temporal progression of the disease in weekly averages. Our analysis begins on March 13. Albeit still low, the highest incidence at this early date was in the provice of Álava, in the North of Spain. Álava is not the most populous province, with a population of only 331,549, but it has the highest GDP per capita of all provinces. Vitoria, its main city, is the capital of the Basque Country and has been the focus of efforts, along with San Sebastian and Bilbao, to develop a "Global Basque City" [@Meijers2008strategic]. The other early focus of the pandemic in Spain was Madrid, which is the most populous province in the country and has the second highest GDP per capita after Álava. The early outbreaks in these two provinces can be traced throughout the progression of the pandemic over time, although by the end of the period under study, other provinces had matched and even surpased their incidence rates, including Segovia and Soria to the north of Madrid, and Ciudad Real and Albacete to the south.
```{r weekly-average-incidence-map, echo=FALSE, fig.height= 9, fig.cap="\\label{fig:weekly-average-incidence-map}Mean weekly incidence of COVID-19 by province, in reported cases by 100,000 people", warning=FALSE, cache = TRUE}
# Retrieve the coordinates of Madrid
madrid_xy <- st_centroid(provinces_spain %>% filter(province == "Madrid")) %>% st_coordinates()
week11.plot <- covid19_spain %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
group_by(province, week = isoweek(Date)) %>%
summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
filter(week == 11) %>%
left_join(provinces_spain,
by = c("ID_INE")) %>%
st_as_sf() %>%
ggplot() +
geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
scale_fill_distiller(name = "Mean Weekly Incidence",
palette = "Reds",
direction = 1,
limits = c(0.755, 988.71)) +
ggtitle("March 13 - March 15") +
theme_bw(base_size = 9) +
theme(axis.text = element_blank(),
legend.position = "bottom",
plot.margin = margin(0, 0, 0, 0, "cm")) +
geom_sf_text(label = "Madrid",
x = madrid_xy[1] - 4.7,
y = madrid_xy[2] + 0.6,
family = "serif") +
geom_segment(x = madrid_xy[1] - 3.4,
y = madrid_xy[2] + 0.6,
xend = madrid_xy[1],
yend = madrid_xy[2])
week12.plot <- covid19_spain %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
group_by(province, week = isoweek(Date)) %>%
summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
filter(week == 12) %>%
left_join(provinces_spain,
by = c("ID_INE")) %>%
st_as_sf() %>%
ggplot() +
geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
scale_fill_distiller(name = "Mean Weekly Incidence",
palette = "Reds",
direction = 1,
limits = c(0.755, 988.71)) +
ggtitle("March 16 - March 22") +
theme_bw(base_size = 9) +
theme(axis.text = element_blank(),
legend.position = "bottom",
plot.margin = margin(0, 0, 0, 0, "cm")) +
geom_sf_text(label = "Madrid",
x = madrid_xy[1] - 4.7,
y = madrid_xy[2] + 0.6) +
geom_segment(x = madrid_xy[1] - 3.4,
y = madrid_xy[2] + 0.6,
xend = madrid_xy[1],
yend = madrid_xy[2])
week13.plot <- covid19_spain %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
group_by(province, week = isoweek(Date)) %>%
summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
filter(week == 13) %>%
left_join(provinces_spain,
by = c("ID_INE")) %>%
st_as_sf() %>%
ggplot() +
geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
scale_fill_distiller(name = "Mean Weekly Incidence",
palette = "Reds",
direction = 1,
limits = c(0.755, 988.71)) +
ggtitle("March 23 - March 29") +
theme_bw(base_size = 9) +
theme(axis.text = element_blank(),
legend.position = "bottom",
plot.margin = margin(0, 0, 0, 0, "cm")) +
geom_sf_text(label = "Madrid",
x = madrid_xy[1] - 4.7,
y = madrid_xy[2] + 0.6,
family = "serif") +
geom_segment(x = madrid_xy[1] - 3.4,
y = madrid_xy[2] + 0.6,
xend = madrid_xy[1],
yend = madrid_xy[2])
week14.plot <- covid19_spain %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
group_by(province, week = isoweek(Date)) %>%
summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
filter(week == 14) %>%
left_join(provinces_spain,
by = c("ID_INE")) %>%
st_as_sf() %>%
ggplot() +
geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
scale_fill_distiller(name = "Mean Weekly Incidence",
palette = "Reds",
direction = 1,
limits = c(0.755, 988.71)) +
ggtitle("March 30 - April 5") +
theme_bw(base_size = 9) +
theme(axis.text = element_blank(),
legend.position = "bottom",
plot.margin = margin(0, 0, 0, 0, "cm")) +
geom_sf_text(label = "Madrid",
x = madrid_xy[1] - 4.7,
y = madrid_xy[2] + 0.6,
family = "serif") +
geom_segment(x = madrid_xy[1] - 3.4,
y = madrid_xy[2] + 0.6,
xend = madrid_xy[1],
yend = madrid_xy[2])
week15.plot <- covid19_spain %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
group_by(province, week = isoweek(Date)) %>%
summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
filter(week == 15) %>%
left_join(provinces_spain,
by = c("ID_INE")) %>%
st_as_sf() %>%
ggplot() +
geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
scale_fill_distiller(name = "Mean Weekly Incidence",
palette = "Reds",
direction = 1,
limits = c(0.755, 988.71)) +
ggtitle("April 6 - April 11") +
theme_bw(base_size = 9) +
theme(axis.text = element_blank(),
legend.position = "bottom",
plot.margin = margin(0, 0, 0, 0, "cm")) +
geom_sf_text(label = "Madrid",
x = madrid_xy[1] - 4.7,
y = madrid_xy[2] + 0.6,
family = "serif") +
geom_segment(x = madrid_xy[1] - 3.4,
y = madrid_xy[2] + 0.6,
xend = madrid_xy[1],
yend = madrid_xy[2])
# Create grid layout
lay = rbind(c(1, 1, 2, 2),
c(1, 1, 2, 2),
c(3, 3, 4, 4),
c(3, 3, 4, 4),
c(NA, 5, 5, NA),
c(NA, 5, 5, NA))
grid.arrange(week11.plot, week12.plot, week13.plot, week14.plot, week15.plot, layout_matrix = lay)
```
Figure \ref{fig:descriptives-temperature} shows the distribution of the environmental variables in Spain. For ease of visualization we aggregate the provinces by Autonomous Community. Each box-and-whisker in the figure represents the distribution of the variable for a community over the 30-day period. In the plot, the communities have been sorted by latitude, so that Principado de Asturias is the northernmost community, and Andalucia the southernmost. As seen in the figure, there is a relatively wide range of values both within and between provinces over the 30-day period examined. The top panel of the figure shows the distribution of mean temperatures. The lowest mean temperature for a community on any given day was approximately 3°C, and the highest about 20°C, for a range of approximately 17 degrees. Likewise, there is a great deal of variability in humidity, as seen in the middle panel of the figure, where the lowest mean humidity for any community is approximately 48% and the highest is close to 100%. Finally, the bottom panel displays mean daily hours of sunshine in the community. This variable displays the most variability within communities over time, but remains relatively constant across communities. It is important to note that these are summaries by community, and the actual values of these variables for the provinces display somewhat more variability.
```{r descriptives-temperature, echo = FALSE, fig.height=7, fig.cap="\\label{fig:descriptives-temperature} Distribution of mean temperatures and humidities in the Autonomous Communities in Spain between March 12, 2020 and April 11, 2020. The Autonomous Communities have been sorted by latitude, with communities to the left being the northermost, and to the right the southernmost", cache = TRUE}
# Summary of temperature by Autonomous Community
temp.summary <- covid19_spain %>%
filter(CCAA != "Canarias" , ) %>%
group_by(CCAA, Date) %>%
summarize(Value = mean(Mean_Temp)) %>%
mutate(Variable = "Mean Temperature in Community")
# Summary of humidity by Autonomous Community
humidity.summary <- covid19_spain %>%
filter(CCAA != "Canarias") %>%
group_by(CCAA, Date) %>%
summarize(Value = mean(Humidity)) %>%
mutate(Variable = "Mean Humidity in Community")
# Summary of hours of sunshine by Autonomous Community
sunshine.summary <- covid19_spain %>%
filter(CCAA != "Canarias") %>%
group_by(CCAA, Date) %>%
summarize(Value = mean(Sunshine_Hours)) %>%
mutate(Variable = "Mean Daily Hours of Sunshina in Community")
# Bind tables
env.summary <- rbind(temp.summary, humidity.summary, sunshine.summary) %>%
mutate(Variable = factor(Variable,
levels = c("Mean Temperature in Community",
"Mean Humidity in Community",
"Mean Daily Hours of Sunshina in Community")))
# Plot
ggplot(data = env.summary,
aes(x = CCAA,
y = Value)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Autonomous Communities in Spain (sorted from north to south)") +
ylab("Value") +
facet_wrap(~ Variable, nrow = 3,
scales = "free_y")
```
Figure \ref{fig:map-controls} includes three maps that display the spatial variation of our control variables, namely GDP per capita, percentage of older adults in province, population density, and presence of mass transit systems. As seen there, GDP per capita is higher in Madrid and the northeast part of the country, mainly in Pais Vasco and Cataluña. Percentage of older adults is somewhat more checkered, with high values in Madrid and other provinces in the center-west part of the country, but also in some provinces in the north. Outside of provinces with major cities (e.g., Madrid; Bizkaia and Gipuzkoa in Pais Vasco; Pontevedra in Galicia), population density tends to be higher in provinces along the Mediterranean coast. The final panel in the figure shows the five provinces in the country that have higher order mass transit systems (e.g., metro).
```{r map-controls, cache=TRUE, echo=FALSE, fig.height= 9, fig.cap="\\label{fig:map-controls}Spatial distribution of control variables by province", warning=FALSE, cache = TRUE}
# GDP per capita:
q <- quantile(provinces_spain$GDPpc)
gdppc.plot <- provinces_spain %>%
mutate(GDPpc = GDPpc,
GDPpc = case_when(GDPpc < q[2] ~ "1st",
GDPpc > q[2] & GDPpc <= q[3] ~ "2nd",
GDPpc > q[3] & GDPpc <= q[4] ~ "3rd",
GDPpc > q[4] ~ "4th")) %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
ggplot() +
geom_sf(aes(fill = GDPpc), color = "gray69") +
scale_fill_manual(name = "Quartile", values=c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D")) +
ggtitle("GDP per capita") +
theme_bw(base_size = 9) +
theme(axis.text = element_blank(),
legend.position = "bottom") +
geom_sf_text(label = "Madrid",
x = madrid_xy[1] - 4.7,
y = madrid_xy[2] + 0.6,
family = "serif") +
geom_segment(x = madrid_xy[1] - 3.4,
y = madrid_xy[2] + 0.6,
xend = madrid_xy[1],
yend = madrid_xy[2])
# Older adults:
q <- quantile(provinces_spain$Older)
older.plot <- provinces_spain %>%
mutate(Older = Older,
Older = case_when(Older < q[2] ~ "1st",
Older > q[2] & Older <= q[3] ~ "2nd",
Older > q[3] & Older <= q[4] ~ "3rd",
Older > q[4] ~ "4th")) %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
ggplot() +
geom_sf(aes(fill = Older), color = "gray69") +
scale_fill_manual(name = "Quartile", values=c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D")) +
ggtitle("Percentage of older adults") +
theme_bw(base_size = 9) +
theme(axis.text = element_blank(),
legend.position = "bottom") +
geom_sf_text(label = "Madrid",
x = madrid_xy[1] - 4.7,
y = madrid_xy[2] + 0.6,
family = "serif") +
geom_segment(x = madrid_xy[1] - 3.4,
y = madrid_xy[2] + 0.6,
xend = madrid_xy[1],
yend = madrid_xy[2])
# Population density:
q <- as.numeric(quantile(provinces_spain$Density))
density.plot <- provinces_spain %>%
mutate(Density = as.numeric(Density),
Density = case_when(Density < q[2] ~ "1st",
Density > q[2] & Density <= q[3] ~ "2nd",
Density > q[3] & Density <= q[4] ~ "3rd",
Density > q[4] ~ "4th")) %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
ggplot() +
geom_sf(aes(fill = Density), color = "gray69") +
scale_fill_manual(name = "Quartile", values=c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D")) +
ggtitle("Population density") +
theme_bw(base_size = 9) +
theme(axis.text = element_blank(),
legend.position = "bottom") +
geom_sf_text(label = "Madrid",
x = madrid_xy[1] - 4.7,
y = madrid_xy[2] + 0.6,
family = "serif") +
geom_segment(x = madrid_xy[1] - 3.4,
y = madrid_xy[2] + 0.6,
xend = madrid_xy[1],
yend = madrid_xy[2])
# Transit systems:
transit.plot <- provinces_spain %>%
mutate(Transit = factor(Transit,
levels = c(0, 1),
labels = c("No",
"Yes"))) %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
ggplot() +
geom_sf(aes(fill = Transit), color = "gray69") +
scale_fill_manual(name = "Transit", values=c("#FFFEDE", "#D5610D")) +
ggtitle("Mass transit systems") +
theme_bw(base_size = 9) +
theme(axis.text = element_blank(),
legend.position = "bottom") +
geom_sf_text(label = "Madrid",
x = madrid_xy[1] - 4.7,
y = madrid_xy[2] + 0.6,
family = "serif") +
geom_segment(x = madrid_xy[1] - 3.4,
y = madrid_xy[2] + 0.6,
xend = madrid_xy[1],
yend = madrid_xy[2])
# Define layout for plot
lay <- rbind(c(1, 2),
c(3, 4))
grid.arrange(gdppc.plot, older.plot, density.plot, transit.plot, layout_matrix = lay)
```
Figure \ref{fig:daily-correlations} shows the distribution of daily simple correlations of incidence of COVID-19 with the independent variables (with the exception of Transit, which is a categorical variable). These correlations are calculated after log-transforming all variables. As previously discussed, the environmental variables have a temporal lag and were calculated using different time windows.
It can be seen in the figure that temperature (in its three forms) has the highest simple correlation with incidence of COVID-19. After temperature, GDP per capita has the highest positive correlation with the dependent variable. The distribution of these correlations is also quite tight over the 30-day period of the study. Hours of sunshine tends to have a moderately high correlation with incidence of COVID-19, but the distribution of these correlations is more spread, and in some cases strays into negative values. A similar thing happens with humidity, which also tends to display more day to day variation in the correlation with the dependent variable. The percentage of older adults shows a relatively tight distribution of day-to-day correlations, and is negative. Population density, in contrast, tends to be negative, but is relatively spread, and on some days, the simple correlation between density and incidence of COVID-19 is weakly positive. Overall for the period under examination, the pairwise correlations between these variables and incidence are significant at $p<0.05$, with the exception of the three Sunshine_hours variables.
```{r daily-correlations, echo=FALSE, fig.cap="\\label{fig:daily-correlations}Distribution of daily correlations of the independent variables with daily incidence of COVID-19 (all variables have been log-transformed)", cache = TRUE}
covid19_spain %>%
filter(CCAA != "Canarias", province != "Baleares") %>%
group_by(Date) %>%
summarize(GDPpc = cor(log(GDPpc), log(Incidence)),
Older = cor(log(Older), log(Incidence)),
Density = cor(log(Density), log(Incidence)),
Humidity_lag8 = cor(Humidity_lag8, Incidence),
Humidity_lag11 = cor(log(Humidity_lag11), log(Incidence)),
Humidity_lag11w = cor(log(Humidity_lag11w), log(Incidence)),
Mean_Temp_lag8 = cor(log(Mean_Temp_lag8), log(Incidence)),
Mean_Temp_lag11 = cor(log(Mean_Temp_lag11), log(Incidence)),
Mean_Temp_lag11w = cor(log(Mean_Temp_lag11w), log(Incidence)),
Sunshine_Hours_lag8 = cor(log(Sunshine_Hours_lag8 + 0.1), log(Incidence)),
Sunshine_Hours_lag11 = cor(log(Sunshine_Hours_lag11 + 0.1), log(Incidence)),
Sunshine_Hours_lag11w = cor(log(Sunshine_Hours_lag11w + 0.1), log(Incidence))) %>%
select(-Date) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Correlation") %>%
mutate(Variable = factor(Variable,
levels = c("GDPpc",
"Older",
"Density",
"Humidity_lag8",
"Humidity_lag11",
"Humidity_lag11w",
"Mean_Temp_lag8",
"Mean_Temp_lag11",
"Mean_Temp_lag11w",
"Sunshine_Hours_lag8",
"Sunshine_Hours_lag11",
"Sunshine_Hours_lag11w"))) %>%
ggplot(aes(x = Variable, y = Correlation)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))
```
```{r correlation-tests, include=FALSE}
cor.test(log(covid19_spain$GDPpc), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Older), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Density), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Humidity_lag8), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Humidity_lag11), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Humidity_lag11w), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Mean_Temp_lag8), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Mean_Temp_lag11), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Mean_Temp_lag11w), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Sunshine_Hours_lag8 + 0.1), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Sunshine_Hours_lag11 + 0.1), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Sunshine_Hours_lag11w + 0.1), log(covid19_spain$Incidence))
```
## SUR Models {#sur-models}
Correlation analysis in the preceding section provides some insights about the potential associations between incidence of COVID-19 and the various environmental and control variables. In this section we estimate three spatial SUR models to test the differences between the various temporal lags and weighting schemes for the environmental variables. Accordingly, we define three models: Model 1, which is estimated using the lagged 8-day averages of the environmental variables (*lag8*); Model 2, which is estimated using the lagged 11-day averages of the environmental variables (*lag11*); and finally, Model 3, which is estimated using the lagged 11-day *weighted* averages of the environmental variables (*lag11w*).
To implement the SUR approach, we must define a matrix of spatial weights $W$. In this case, we consider neighborhoods based on adjacency, based on the well-known queen criterion (two provinces are adjacent if they share a boundary or touch at a vertex). The analysis is of the coterminous provinces\footnote{As a check for robustness, we also tested the rook criterion (two provinces are adjacent if they share a boundary, but not if they only touch at a vertex), and included the three islands in the sample. In this case we made an allowance for adjacency between the two islands in the Autonomous Community of Canarias in the Pacific (Las Palmas and Santa Cruz de Tenerife), and assumed that Islas Baleares in the Mediterranean are adjacent to three provinces in Pais Catalans (i.e., Barcelona, Tarragona, and Castello) The results (which can be consulted in the source R markdown document) are robust to the specification of $W$.}.
For estimation, we log-transform the dependent and quantitative independent variables. The only variable that is not transformed is the categorical variable for transit systems. A log-log transformation is appropriate to capture non-linear relationships between variables and provides a straightforward interpretation of the coefficients as percentage change. Furthermore, we introduce restrictions so that the coefficients of two of our control variables are constant over time, namely GDP per capita and percentage of older adults. We do not see an *a priori* reason to let those two variables vary across equations, and the correlation analysis in Figure \ref{fig:daily-correlations} also suggest little temporal variation. In contrast, we let the spatial autocorrelation parameter, as well as the parameters of the rest of the independent variables (including the constant) to vary over time\footnote{We conducted sensitivity analysis letting all parameters vary over time, and while the results are qualitatively similar, the resulting models are less parsimonious. These results are available in the source R markdown document.}. This will be useful to detect whether there are behavioral adaptations at the population level over the course of the period examined. As an example of behavioral adaptations, the effect of density might weaken over time, in the measure that the effects of the lockdown are felt: at full compliance with the lockdown, with people practicing social avoidance, density might matter less than other factors.
```{r data-preparation-for-modelling, eval = FALSE, include=FALSE, cache = TRUE}
# Organize data for SUR modelling
GPanel <- plm::pdata.frame(covid19_spain %>%
select(province,
Date,
Incidence,
Median_Age,
Male2Female,
Older,
GDPpc,
Density,
Transit,
Mean_Temp_lag8,
Humidity_lag8,
Sunshine_Hours_lag8,
Mean_Temp_lag11,
Humidity_lag11,
Sunshine_Hours_lag11,
Mean_Temp_lag11w,
Humidity_lag11w,
Sunshine_Hours_lag11w),
c("province","Date"))
```
```{r spatial-weights, eval = FALSE, include=FALSE, cache = TRUE}
# Create spatial weights matrix:
Wmat <- provinces_spain %>%
#drop_na() %>%
as("Spatial") %>%
#poly2nb(queen = FALSE) %>%
poly2nb(queen = TRUE) %>%
nb2mat(zero.policy = TRUE)
Wmat <- (Wmat > 0) * 1
# Join the two provinces in Canarias
Wmat[which(provinces_spain$province == "Palmas(Las)"),
which(provinces_spain$province == "Santa Cruz de Tenerife")] <- 1
Wmat[which(provinces_spain$province == "Santa Cruz de Tenerife"),
which(provinces_spain$province == "Palmas(Las)")] <- 1
# 'Paises Catalans'
#n = 8
Wmat[which(provinces_spain$province == "Barcelona"),
which(provinces_spain$province == "Baleares")] <- 1
Wmat[which(provinces_spain$province == "Baleares"),
which(provinces_spain$province == "Barcelona")] <- 1
Wmat[which(provinces_spain$province == "Baleares"),
which(provinces_spain$province == "Castellon/Castello")] <- 1
Wmat[which(provinces_spain$province == "Castellon/Castello"),
which(provinces_spain$province == "Baleares")] <- 1
Wmat[which(provinces_spain$province == "Baleares"),
which(provinces_spain$province == "Tarragona")] <- 1
Wmat[which(provinces_spain$province == "Tarragona"),
which(provinces_spain$province == "Baleares")] <- 1
miW <- Wmat/rowSums(Wmat)
# Convert to listw
listw <- mat2listw(Wmat,style = "W")
```
```{r data-preparation-for-modelling-minus-islands, eval = TRUE, include=FALSE, cache = TRUE}
# Organize data for SUR modelling
GPanel <- plm::pdata.frame(covid19_spain %>%
filter(province != "Baleares", CCAA != "Canarias") %>%
select(province,
Date,
Incidence,
Median_Age,
Male2Female,
Older,
GDPpc,
Density,
Transit,
Mean_Temp_lag8,
Humidity_lag8,
Sunshine_Hours_lag8,
Mean_Temp_lag11,
Humidity_lag11,
Sunshine_Hours_lag11,
Mean_Temp_lag11w,
Humidity_lag11w,
Sunshine_Hours_lag11w),
c("province","Date"))
```
```{r spatial-weights-minus-islands, eval=TRUE, include=FALSE, cache = TRUE}
# Create spatial weights matrix:
Wmat <- provinces_spain %>%
filter(province != "Baleares", CCAA != "Canarias") %>%
#drop_na() %>%
as("Spatial") %>%
poly2nb(queen = TRUE) %>%
nb2mat(zero.policy = TRUE)
#Wmat <- (Wmat > 0) * 1
#miW <- Wmat/rowSums(Wmat)
# Convert to listw
listw <- mat2listw(Wmat,style = "W")
```
```{r formulas, include=FALSE, cache = TRUE}
# Define formulas with three different lagged variables:
formula_lag8 <- log(Incidence) ~
log(GDPpc) +
log(Older) +
log(Density) +
Transit +
log(Humidity_lag8) +
log(Mean_Temp_lag8) +
log(Sunshine_Hours_lag8 + 0.1)
formula_lag11 <- log(Incidence) ~
log(GDPpc) +
log(Older) +
log(Density) +
Transit +
log(Humidity_lag11) +
log(Mean_Temp_lag11) +
log(Sunshine_Hours_lag11 + 0.1)
formula_lag11w <- log(Incidence) ~
log(GDPpc) +
log(Older) +
log(Density) +
Transit +
log(Humidity_lag11w) +
log(Mean_Temp_lag11w) +
log(Sunshine_Hours_lag11w + 0.1)
```
```{r model-restrictions, include=FALSE, cache = TRUE}
T <- max(covid19_spain$Date) - min(covid19_spain$Date) + 1 # Recall that T is the number of days, i.e., time periods, i.e., equations; add 1 to include the starting day
k <- 8 # Number of independent variables, including the constant
coef_rest <- 2 # Number of restrictions
# nrow is number of equations (time periods) minus 1, times the number of restrictions
# ncol is number of variables times number of equations
R2 <- matrix(0, nrow = (T - 1) * coef_rest, ncol = k * T)
for (i in 1:(T-1)){
R2[i, 2] <- 1
R2[i, (2 + i * k)] <- -1
R2[(i + T - 1), 3] <- 1
R2[(i + T - 1), (3 + i * k)] <- -1
# Use if more restrictions are needed
#R2[(i + T - 1) * 2, 4] <- 1
#R2[(i + T - 1) * 2, (4 + i * k)] <- -1
}
b2 <- matrix(0, ncol = (T - 1) * coef_rest)
```
```{r model1-8-day-moving-average, include=FALSE, cache = TRUE}
# Model with a lagged 8-day moving average of climatic variables:
sur.slm_lag8 <- spsur::spsurtime(formula = formula_lag8,
data=GPanel,
time = GPanel$Date,
type = "slm",
fit_method = "3sls",
listw= listw,
R = R2,
b = b2)
#summary(sur.slm_lag8)
#print(paste("Pooled R^2 = ", sur.slm_lag8$R2[1]))
```
```{r model1-8-day-moving-average-all, include=FALSE, eval = FALSE}
# Model with a lagged 8-day moving average of climatic variables - NO RESTRICTIONS TO THE COEFFICIENTS:
sur.slm_lag8_all <- spsur::spsurtime(formula = formula_lag8,
data=GPanel,
time = GPanel$Date,
type = "slm",
fit_method = "3sls",
listw= listw)
```
```{r model2-11-day-moving-average, include=FALSE, cache = TRUE}
# Model with 11-day moving average of climatic variables:
sur.slm_lag11 <- spsur::spsurtime(formula = formula_lag11,
data=GPanel,
time = GPanel$Date,
type = "slm",
fit_method = "3sls",
listw= listw,
R = R2,
b = b2)
#summary(sur.slm_lag11)
#print(paste("Pooled R^2 = ", sur.slm_lag11$R2[1]))
```
```{r model2-11-day-moving-average-all, include=FALSE, cache = TRUE}
# Model with 11-day moving average of climatic variables:
sur.slm_lag11_all <- spsur::spsurtime(formula = formula_lag11,
data=GPanel,
time = GPanel$Date,
type = "slm",
fit_method = "3sls",
listw= listw)
```
```{r model3-11-day-weighted-moving-average, include=FALSE, cache = TRUE}
# Model with 11-day weighted moving average of climatic variables:
sur.slm_lag11w <- spsur::spsurtime(formula = formula_lag11w,
data=GPanel,
time = GPanel$Date,
type = "slm",
fit_method = "3sls",
listw= listw,
R = R2,
b = b2)
#summary(sur.slm_lag11w)
#print(paste("Pooled R^2 = ", sur.slm_lag11w$R2[1]))
```
```{r model3-11-day-weighted-moving-average-all, include=FALSE, cache = TRUE}
# Model with 11-day weighted moving average of climatic variables:
sur.slm_lag11w_all <- spsur::spsurtime(formula = formula_lag11w,
data=GPanel,
time = GPanel$Date,
type = "slm",
fit_method = "3sls",
listw= listw)
```
After estimation, we compare the goodness of fit of the three SUR models. Figure \ref{fig:goodness-of-fit} shows the equation-level coefficient of determination $R^2$, one for each time period/day. As well, the overall coefficient of determination for the system is reported for each model $\text{pooled}-R^2$. The general trend is identical for the three models, with the goodness-of-fit improving over time and plateuing around a value of $R^2$ of 0.55. Model 1 (*lag8*) performs somewhat better in the first few equations/days, when the goodness-of-fit is relatively poor, and then again in the last few equations/days. Model 3 (*lag11w*), in contrast, does not perform well towards the end of the study period. The most balanced model in terms of equation-level goodness-of-fit appears to be Model 1 (*lag8*), and this impression is further supported by a slightly higher value of the $\text{pooled}-R^2$. The analysis using a lagged moving average of the environmental variables is generally in line with the incubation period reported by Lauer et al. [-@Lauer2020incubation], although the results do not support the use of a weighted average. For the remainder of the paper, we will adopt Model 1 (*lag8*) as our best model. In the following section we discuss the results of the analysis in more depth.
```{r goodness-of-fit, cache = TRUE, echo=FALSE, fig.cap="\\label{fig:goodness-of-fit} Goodness of fit of the SUR systems: by date and pooled"}
data.frame(Model1 = sur.slm_lag8$R2,
Model2 = sur.slm_lag11$R2,
Model3 = sur.slm_lag11w$R2) %>%
slice(2:n()) %>%
rownames_to_column(var = "Equation") %>%
mutate(Date = seq(ymd("2020-03-13"),
ymd("2020-04-11"),
by = "days")) %>%
pivot_longer(cols = starts_with("M"), names_to = "Model", values_to = "R2") %>%
mutate(Model = factor(Model,
levels = c("Model1", "Model2", "Model3"),
labels = c("Model 1: lag8", "Model 2: lag11", "Model 3: lag11w"))) %>%
ggplot(aes(x = Date, y = R2, color = Model, shape = Model)) +
geom_point(size = 2) +
scale_color_manual(values = c("Model 1: lag8" = "blue", "Model 2: lag11" = "orange", "Model 3: lag11w" = "green") ) +
annotate(geom = "text",
label = c(paste("pooled-R2 (Model 1: lag8)=",
round(sur.slm_lag8$R2[1], 4)),
paste("pooled-R2 (Model 2: lag11)=",
round(sur.slm_lag11$R2[1], 4)),
paste("pooled-R2 (Model 3: lag11w)=",
round(sur.slm_lag11w$R2[1], 4))),
x = as_date("2020-04-01"),
y = c(0.3, 0.25, 0.20),
family = "serif",
size = 5) +
theme_bw() +
theme(legend.position = "bottom")
```
```{r goodness-of-fit-all, eval=FALSE, echo=FALSE, fig.cap="\\label{fig:goodness-of-fit-all} Goodness of fit of the SUR systems NO RESTRICTION TO THE COEFFICIENTS: by date and pooled"}
data.frame(Model1 = sur.slm_lag8_all$R2,
Model2 = sur.slm_lag11_all$R2,
Model3 = sur.slm_lag11w_all$R2) %>%
slice(2:n()) %>%
rownames_to_column(var = "Equation") %>%
mutate(Date = seq(ymd("2020-03-13"),
ymd("2020-04-11"),
by = "days")) %>%
pivot_longer(cols = starts_with("M"), names_to = "Model", values_to = "R2") %>%
mutate(Model = factor(Model,
levels = c("Model1", "Model2", "Model3"),
labels = c("Model 1: lag8", "Model 2: lag11", "Model 3: lag11w"))) %>%
ggplot(aes(x = Date, y = R2, color = Model, shape = Model)) +
geom_point(size = 2) +
scale_color_manual(values = c("Model 1: lag8" = "blue", "Model 2: lag11" = "orange", "Model 3: lag11w" = "green") ) +
annotate(geom = "text",
label = c(paste("pooled-R2 (Model 1: lag8)=",
round(sur.slm_lag8_all$R2[1], 4)),
paste("pooled-R2 (Model 2: lag11)=",
round(sur.slm_lag11_all$R2[1], 4)),
paste("pooled-R2 (Model 3: lag11w)=",
round(sur.slm_lag11w_all$R2[1], 4))),
x = as_date("2020-04-01"),
y = c(0.3, 0.25, 0.20),
family = "serif",
size = 5) +
theme_bw() +