README.Rmd

---
output: github_document
always_allow_html: true
bibliography: References.bib
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

# A spatio-temporal analysis of the environmental correlates of COVID-19 incidence in Spain

Antonio Paez (McMaster University)  
Fernando A. Lopez (Universidad Politecnica de Cartagena)  
Tatiane Menezes (Universidade Federal de Pernambuco)  
Renata Cavalcanti (Universidade Federal de Pernambuco)  
Maira Galdino da Rocha Pitta (Universidade Federal de Pernambuco)  

**Geographical Analysis (Early View) https://doi.org/10.1111/gean.12241**

**Pre-print available [here](https://github.com/paezha/covid19-environmental-correlates/blob/master/Environmental-Correlates-of-COVID19-Spain/Environmental-Correlates-of-COVID19-Spain.pdf)**

## Abstract

The novel SARS-CoV2 has disrupted health systems and the economy, and public health interventions to slow its spread have been costly. How and when to ease restrictions to movement hinges in part on whether SARS-CoV2 will display seasonality due to variations in temperature, humidity, and hours of sunshine. Here, we address this question by means of a spatio-temporal analysis in Spain of the incidence of  COVID-19, the disease caused by the virus. Use of spatial Seemingly Unrelated Regressions (SUR) allows us to model the incidence of reported cases of the disease per 100,000 population as an interregional contagion process, in addition to a function of temperature, humidity, and sunshine. In the analysis we also control for GDP per capita, percentage of older adults in the population, population density, and presence of mass transit systems. The results support the hypothesis that incidence of the disease is lower at higher temperatures and higher levels of humidity. Sunshine, in contrast, displays a positive association with incidence of the disease. Our control variables also yield interesting insights. Higher incidence is associated with higher GDP per capita and presence of mass transit systems in the province; in contrast, population density and percentage of older adults display negative associations with incidence of COVID-19.   

## Keywords

SARS-CoV2  
COVID-19  
Seasonality  
Temperature  
Humidity  
Population density  
Older adults  
Spatial SUR  
Contagion  
Spain  

```{r load-packages, include=FALSE}
# Download and install these two packages:
# covid19env
# spsur

library(covid19env)
library(ggthemes)
library(gridExtra)
library(kableExtra)
library(lubridate)
library(plm)
library(sf)
library(spdep)
library(spsur)
library(tidyverse)
library(units)
```

```{r load-data, include=FALSE, cache=TRUE}
# Load data from package `covid19env`
data("covid19_spain_1") # Use data sourced from one representative station for the province
#data("covid19_spain_2") # Use data with the average of all stations in provinces
data("provinces_spain")
```

```{r data-preparation, include=FALSE, cache=TRUE}
# Convert GDP per capita to thousands of euros:
provinces_spain <- provinces_spain %>%
  mutate(GDPpc = GDPpc/1000)

# Join provincial data to incidence data and convert to simple features:
covid19_spain <- covid19_spain_1 %>% 
#covid19_spain <- covid19_spain_2 %>% # data with the average of all stations in provinces   
  left_join(provinces_spain %>% st_drop_geometry(),
            by = c("province", "CCAA", "ID_INE"))
```

```{r sort-communities-by-latitude, include=FALSE, cache=TRUE}
# To visualize the distribution of temperature by CCAA, we sort the communities by latitude, from north to south. We do this by community instead of province because the large number of provinces clutters the plots.

# Dissolve internal provincial boundaries to obtain a simple features object with the autonomous communities: 
ccaa.sf <-  provinces_spain %>% 
  group_by(CCAA) %>% 
  summarize(provinces = n())

# Extract coordinates of autonomous communities
ccaa.coords <- ccaa.sf %>%
  st_centroid() %>%
  st_coordinates() %>%
  as.data.frame()

# Join Y coordinate to ccaa.sf
ccaa.sf <- ccaa.sf %>% 
  mutate(long = ccaa.coords$Y)

# Sort autonomous communities from north to south
ccaa.levels <- ccaa.sf %>% 
  arrange(desc(long)) %>% select(CCAA)

ccaa.levels <- as.character(ccaa.levels$CCAA)

# Relevel autonomous communities
covid19_spain <- covid19_spain %>%
  mutate(CCAA = factor(CCAA, levels = ccaa.levels, ordered = TRUE))
```

Introduction {#introduction}
==========================
  
From a small outbreak linked to a live animal market at the end of 2019 to a global pandemic in a matter of weeks, the SARS-CoV2 virus and COVID-19, the disease it causes, have threatened to overrun health systems around the world. In efforts to contain the spread of the disease, numerous governments in many regions and nations have either recommended or mandated social distancing measures, and as of this writing, millions of people in five continents shelter in place. There are encouraging signs that these measures have mitigated the spread of the virus [e.g., @Lancastle2020impact; @Lewnard2020scientific; @Wilder2020isolation]. Even so, this has come at a high cost, and the consequences for all spheres of economic, social, and cultural life have been dire [e.g., @Fernandes2020economic; @Luo2020how]. As a result, there is a sense of urgency to anticipate the progression of the pandemic, in order to plan for progressive lifting of restrictions to movement and social contact [e.g., @Kissler2020projecting]. Needless to say, this has become a delicate, and politically charged, balancing act between public health and the economy [@Gong2020balance].    

A salient question in the debate about how and when to ease social distancing measures is whether the virus will display seasonal variations. Existing research on similar pathogens suggests that the virus could be more stable and potentially easier to transmit in conditions of low temperature and low humidity. While this is encouraging, it is important to keep in mind that "not all seasonal respiratory viruses experience the same spatiotemporal patterns" [@deangel2020weathering, section 4]. This urges caution when extrapolating from known viruses. The evidence in this respect is as yet inconclusive, and although easing restrictions as the weather warms may appear tempting, doing so prematurely could well undo weeks or possibly months of costly measures. 

It is not surprising, given the stakes involved, that this issue has already triggered a lively debate. The current state of knowledge was well-summarized by the National Academy of Sciences, Engineering, and Medicine in the U.S. in a recent report [see @National2020rapid]. Engaged by the Office of the Executive for guidance on this matter, this organization concluded that: "[some] limited data support a potential waning of cases in warmer and more humid seasons, yet none are without major limitations...Additional studies as the SARS-CoV2 pandemic unfolds could shed more light on the effects of climate on transmission" (p. 6). To further complicate matters, much of the relevant work has yet to be peer-reviewed [see for instance the challenge of @Harbert2020spatial; to @Araujo2020spread]. 

With the above considerations in mind, our objective with this paper is to investigate the influence of environmental factors, concretely temperature, humidity, and sunshine, on the progression of the pandemic. We adopt a population health approach, and report results from a spatio-temporal model of the incidence of COVID-19 in the coterminous provinces in Spain, one of the countries hardest hit by the pandemic. We combine data on reported cases of the disease with metereological information, to create a spatio-temporal dataset covering a period of 30 days. We then join this dataset with provincial-level economic and demographic information to act as controls to our key environmental variables. These data are analyzed using a spatial Seemingly Unrelated Regressions (SUR) approach, which allows us to model incidence of COVID-19 as a contagion process. 

The results provide evidence of the effect of temperature, humidity, and sunshine on the incidence of COVID-19. The clearest result with respect to these variables is a lower incidence of COVID-19 at higher temperatures and levels of humidity, while the opposite happens with respect to hours of sunshine. Our control variables also provide some intriguing insights. Higher incidence is associated with higher GDP per capita and presence of mass transit systems in the province; in contrast, population density and percentage of older adults display negative associations with incidence of COVID-19. The results of this analysis provide support to the hypothesis of seasonality of the novel SARS-CoV2, and should be of interest to public health officials and policy makers grappling with the question of the trajectory of the pandemic.

Please note that this paper is prepared as a reproducible research document. The source `R` markdown document, as well as all data and code needed to reproduce/review/extend the analysis can be obtained from a public repository\footnote{\url{https://drive.google.com/drive/u/1/folders/1d_40N_QXo2Fl14r3T3CHs84IED-B1rV_}}.

Background {#background}
============
  
The global emergence of infectious diseases is mostly driven by environmental, ecological, and socio-economic factors [@Jones2008global]. In the case of SARS-CoV2, the ecological factors include the interaction between humans and wildlife. Once transmission of a disease begins to happen between humans, socio-economic and environmental factors become increasingly important. As noted in the introduction, the focus of the paper is on environmental variables, concretely three related to meteorological conditions: temperature, humidity, and sunshine. 

Much of what is known about the potential seasonality of SARS-CoV2 is a result of research on other pathogens. Earlier, diverse studies have shown the effect of temperature and humidity on the incidence of influenza [e.g., @Makinen2009cold; @Jaakkola2014decline; @Kudo2019low]. Jaakkola et al. [-@Jaakkola2014decline], for example, found that a decrease of temperature and absolute humidity precedes the onset of symptoms of influenza A and B viruses by 3 days in places where the temperature is low. After the 2002-2004 outbreak of SARS, researchers also began to investigate the relationship between these factors and SARS-CoV [@Casanova2010effects; @Chan2011effects]. Casanova et al. [-@Casanova2010effects], for instance, used surrogates to find that virus inactivation was likely more rapid at higher temperatures; in terms of humidity, these researchers reported that survival of the virus was lower at moderate relative humidity levels. Chan et al. [-@Chan2011effects] also found that viability of the virus that causes SARS is also lost at higher temperatures (>38°C) and relative humidity superior to 95%.

Whether results from laboratory experiments will hold when the virus circulates in the community remains uncertain. At a global scale, de Ángel Solá et al. [-@deangel2020weathering] see less risk from SARS-CoV2 in the Caribean Basin; on the other hand, Coelho et al. [@Coelho2020exponential] warn that at least during the exponential phase, expansion of the virus is not driven by climate. Similarly, whereas Araujo and Naimi [-@Araujo2020spread] argue that spread of SARS-CoV2 will likely be constrained by climate, Harbert et al. [-@Harbert2020spatial] remain unconvinced that spatial modelling can currently discriminate the distribution of the disease on the basis of climate, at least in the United States. Yao et al. [-@Yao2020association], examined data from China and came to the conclusion that neither temperature nor ultraviolet indices had an association with transmission of COVID-19. This is despite previous research that has linked less exposure to UVB radiation to higher prevalence and severity of acute respiratory tract infections [@Zittermann2016vitamin; @Dąbrowska2018vitamin; @Dinlen2016association; @Mathyssen2017vitamin; @Esposito2015vitamin; @Jat2017vitamin; @Moriyama2020seasonality].

In addition to the environmental variables above, from a population health perspective it is also important to account for potential socio-economic and demographic confounders. 

To account for population-level factors, the first variable that we consider is GDP per capita. Much has been written about globalization and the spread of infectious disease\footnote{As the Globe and Mail, Canada's paper of record, put it in relation to the SARS outbreak in 2003: "Globalization means that if someone in China sneezes, someone in Toronto may one day catch a cold" (Editorial, March 28, 2003, p. A18)}. The growth in global connections has presented a challenge to spatial approaches in the initial stages of disease management, when the cause of a disease may still be unclear but the plane has already departed [@Zhou2016accelerated]. In reference to the earlier outbreak of SARS, van Wagner [-@vanWagner2008toward] chronicles how Toronto's status as a global city turned out to be a vulnerability in this respect. In our case, we think of GDP per capita as a marker of a region's relative position in a network of global cities, and its potential to be further ahead in the trajectory of the pandemic. Furthermore, wealthier regions also tend to concentrate more activities that produce non-traded goods, including building and construction [@Hallet2002regional]. Therefore, it is possible that wealthier regions remain relatively more active even during a lockdown. On the other hand, we cannot discount the possibility that less wealthy regions have a higher proportion of workers in manual occupations who cannot telework, and therefore have more difficulties complying with shelter-in-place orders.

Secondly, we consider percentage of older adults (over 65) in a region. Early evidence regarding COVID-19 suggests that the case rate mortality is higher at older ages [e.g. @Novel2020epidemiological]. However, it is not clear that a relatively large population of older adults necessarily translates into higher transmission rates of the infection. The tool of choice in containing the spread of the disease has been social distancing. In this respect, the evidence from the field of transportation is that older adults tend to travel less frequently, for shorter distances, and have higher rates of immobility than most everyone, except the youngest members of the public [e.g., @Roorda2010trip; @Morency2011distance; @Sikder2012immobility]. In other words, many older adults are, whether by preference or otherwise, already in a form of social isolation. Social distancing during the pandemic may actually reinforce that condition for them, as suggested by the analysis of age-structured social contact in India, China, and Italy of Sing and Adhikari [-@Singh2020age]. Since the age-structured matrix of social contact in Spain is similar to Italy [see @Prem2017projecting], our expectation is that populations with higher percentages of older adults will tend to have lower levels of social contact and hence of incidence.

Population density is also relevant since it directly affects the contact patterns and contact rates between individuals in a population [@Hu2013scaling]. The evidence available suggests a positive relationship between the transmission of COVID-19 and population density (e.g. cumulative incidence in urban areas like NYC). For this reason, we anticipate a positive relationship between population density and the incidence of the disease. 

The last variable that we consider as a control is the presence of mass transit systems in a province. Every province in Spain offers some form of public transportation, but only five provinces have higher order systems of mass mobility (e.g. metro or subway), namely Barcelona, Madrid, Sevilla, Valencia, and Bizkaia. Public transportation has been hypothesized to relate to the spread of contagious disease by some researchers using agent-based approaches and simulation [e.g., @Perez2009agent; @Wang2010gis], and while we find scant evidence of a link in the literature, the idea is intuitively appealing. After all, unlike the isolation that a car offers to travellers, most mass transit system are cauldrons of social contact.

Context and Data {#context-and-data}
===================

## Covid-19 in Spain {#covid-19-in-spain}
  
The first reported case of COVID-19 in Spain was on January 31st, 2020, when a German tourist in the Canary Islands tested positive for the virus. After this case, it was still a few weeks before the first domestic case was reported, on February 27th in Sevilla province (Andalusia). In a short period of time, as testing started to ramp up, it became clear that an outbreak was flaring. By March 11th the World Health Organization (WHO) declared COVID-19 officially a pandemic. This declaration marked a turning point for the public in Spain too. As of March 13th, the number of cases of COVID-19 reported in Spain was 4,473, with a majority of cases (1,990) concentrated in Madrid: these numbers were at the time the worst outbreak in Europe after Italy. In response to the situation, on March 13th the Spanish National Government declared a state of emergency, to go into effect on Saturday March 14th. As part of the state of emergency restrictions to most activities were imposed, with the exception of essential services (e.g. food, health) and some economic subsectors of industry and construction. A few days later, on March 17th, Spain closed its lands borders to allow entry only to returnee nationals and permanent residents. The lockdown was further hardened between March 30th and April 12th (including the Easter weekend of April 10th-12th) and during this period only essential activities were allowed. During this period, there was a dramatic reduction in overall mobility, both within provinces as between \footnote{https://www.mitma.gob.es/ministerio/covid-19/evolucion-movilidad-big-data/movilidad-provincial}.

## Data {#data}

Our dataset includes information about the daily number of cases of COVID-19 reported in Spain at the provincial level (NUTS3 in Eurostat terminology) for the period between March 13th and April 11th, inclusive. For our purposes, we consider positive cases reported, but exclude symptomatic cases diagnosed by a doctor without a Polymerase Chain Reaction (PCR) test. The Spanish National Government publishes periodic updates at the regional level (NUTS2) and the information is also released at the provincial level as part of a collaborative project by geovoluntarios.com\footnote{https://www.geovoluntarios.org/}, ProvidencialData19\footnote{https://www.datoscovid.es/pages/providencialdata19}, and ESRI España. This information is compiled from several sources, mainly the official web pages of the Spanish regional goverments, as documented in Centro de Datos Covid-19\footnote{https://www.datoscovid.es/pages/sobre-la-iniciativa}. We consider two sets of explanatory variables. The first one, and the focus of this research, are the three environmental variables, collected from official sources (i.g., *AEMET*, the state meteorology agency, and *MAPA*, the ministry of agriculture, fisheries, and food). The second set provides some relevant controls as discussed above, and are also collected from official sources, i.e., INE, the national statistics institute. Table \ref{tab:descriptive-statistics} shows the descriptive statistics and the provenance of the data used in this research.

The spatial and temporal coverage of the data is as follows. Our dataset begins on March 13, which is the first date when every province had reported at least one case of COVID-19, and continues until April 11, for a period of 30 days. The unit of analysis is the province. Provinces are the equivalent of states, and are embedded in Autonomous Communities. As an example, Cataluña is an Autonomous Community and consists of four provinces, namely Barcelona, Gerona, Lerida, and Tarragona. The size of the provinces is relatively large, as seen in Table \ref{tab:descriptive-statistics}. The smallest province is $1,978.12km^2$ (this is smaller than Rhode Island in the U.S.) and the largest province is $21,767.93km^2$ (slightly smaller than New Jersey in the U.S.). While this is a relatively large degree of spatial aggregation, reporting on COVID-19 is not yet consistent at smaller geographies, or cases are not reported at that level at all.

```{r descriptive-statistics, echo=FALSE, cache=TRUE}
data.frame(Variable = c("COVID-19 Incidence",
                        "Area", 
                        "GDPpc", 
                        "Older", 
                        "Population Density",
                        "Mean Temperature",
                        "Humidity",
                        "Sunshine"),
           Note = c("Incidence in reported cases of SARS-19 per 100,000 people",
                    "Area of province in sq.km",
                    "GDP per capita in €1,000s",
                    "Percentage of people aged 65 and older in the province",
                    "Population density in the province in people per sq.km",
                    "Mean temperature in province by date, in °C",
                    "Relative humidity in province by date",
                    "Daily hours of sunshine in province by date"),
           Min = c(min(covid19_spain$Incidence),
                   min(provinces_spain %>% 
                         st_area() %>%
                         set_units(km^2)),
                   min(provinces_spain$GDPpc),
                   min(provinces_spain$Older),
                   min(provinces_spain$Density),
                   min(covid19_spain$Mean_Temp),
                   min(covid19_spain$Humidity),
                   min(covid19_spain$Sunshine_Hours)),
           Mean = c(mean(covid19_spain$Incidence),
                    mean(provinces_spain %>% 
                         st_area() %>%
                         set_units(km^2)),
                    mean(provinces_spain$GDPpc),
                    mean(provinces_spain$Older),
                    mean(provinces_spain$Density),
                    mean(covid19_spain$Mean_Temp),
                    mean(covid19_spain$Humidity),
                    mean(covid19_spain$Sunshine_Hours)),
           Max = c(max(covid19_spain$Incidence),
                   max(provinces_spain %>% 
                         st_area() %>%
                         set_units(km^2)),
                   max(provinces_spain$GDPpc),
                   max(provinces_spain$Older),
                   max(provinces_spain$Density),
                   max(covid19_spain$Mean_Temp),
                   max(covid19_spain$Humidity),
                   max(covid19_spain$Sunshine_Hours)),
           SD = c(sd(covid19_spain$Incidence),
                  sd(provinces_spain$GDPpc),
                  sd(provinces_spain %>% 
                         st_area() %>%
                         set_units(km^2)),
                  sd(provinces_spain$Older),
                  sd(provinces_spain$Density),
                  sd(covid19_spain$Mean_Temp),
                  sd(covid19_spain$Humidity),
                  sd(covid19_spain$Sunshine_Hours)),
           Source = c("ProvidencialData19",
                      "INE",
                      "INE",
                      "INE",
                      "INE",
                      "AEMET",
                      "MAPA",
                      "AEMET")) %>%
  kable(#"latex",
        "html",
        booktabs = TRUE,
        digits = 2,
        caption = "\\label{tab:descriptive-statistics} Descriptive statistics") %>%
  #kable_styling(latex_options = c("striped", "scale_down")) %>%
  kable_styling(bootstrap_options = c("striped", "condensed")) %>%
  column_spec(2, width = "15em") %>%
  column_spec(7, width = "5em") %>%
  footnote(general = c("ProvidencialData19: https://www.datoscovid.es/pages/providencialdata19", 
                       "INE (Instituto Nacional de Estadistica): https://www.ine.es/",
                       "AEMET (Agencia Estatal de Meteorologia): http://eportal.mapa.gob.es",
                       "MAPA (Ministerio de Agricultura, Pesca y Alimentacion): http://eportal.mapa.gob.es"))
```

An important aspect of working with environmental data such as temperature, humidity, and hours of sunshine, is the incubation period of the disease. Lauer et al. [-@Lauer2020incubation] report for the case of COVID-19 a median incubation period of 5.7 days (with a confidence interval between 4.9 to 7.8 days). The vast majority of cases (95%) develop symptoms between 2.6 days (CI, 2.1 to 3.7 days) and 12.5 days (CI, 8.2 to 17.7 days). For this reason, we judge it best to use lagged values of the environmental variables. We test different time lags as follows. We consider a lagged 8-day average, from date-minus-12 to date-minus-5 days (hereafter *lag8*). Secondly, we consider a lagged 11-day average, from date-minus-12 to date-minus-2 days (hereafter *lag11*). Finally, to account for the likely duration of incubation, we consider a lagged 11-day *weighted* average, from date-minus-12 to date-minus-2 days (hereafter *lag11w*). The weights for this variable are calculated using the parameters of the log-normal distribution reported by Lauer et al. [-@Lauer2020incubation], i.e., a log-mean of 1.621 and a log-standard deviation of 0.418. With these weights, the environmental variables at date-minus-2 and date-minus-12 days are weighted as 0.041 and 0.009 respectively, whereas the environmental variables at date-minus-5 days are weighted as 0.194. These weights have the effect of changing the contribution of daily values to the lagged moving average. For instance, the temperature at date-minus-4-days is weighted more heavily than the temperature at date-minus-10-days, as a closer approximation of the conditions at the most likely time of contagion before the disease became manifest.

Methods: the Spatial SUR Model {#methods}
===================

The Seemingly Unrelated Regression equations model (SUR hereafter) is a multivariate econometric model used in different fields when the structure of the data consists of cross-sections for different time periods. The basis of this approach is well-known since the initial works of Zellner [-@Zellner1962efficient], and has become a popular methodology included in several econometrics textbook [e.g., @Greene2003econometric]. To our knowledge, Anselin [-@Anselin1988spatial] was the first author to discuss SUR from a spatial perspective, in the context of spatio-temporal analysis. In his landmark text, Anselin discussed a model made of "an equation for each time period, which is estimated for a cross section of spatial units" (p. 141). From this milestone, a large body of research has developed to extend the classical SUR into a spatial framework [e.g., @Rey1999us; @Lauridsen2010spatiotemporal; @Legallo2006evaluating; @Lopez2017spatial].

The classical SUR model without spatial effects (from here, SUR-SIM) is a stack of equations as follows:
  
<!--
\begin{equation}
\label{eq:sur-sim}
\begin{bmatrix}
y_1 \\ y_2 \\ \vdots \\ y_T
\end{bmatrix}
=
  \begin{bmatrix}
X_1 & 0 & \cdots & 0 \\ 0 & X_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & X_T
\end{bmatrix}
\
\begin{bmatrix}
\beta_1 \\ \beta_1 \\ \vdots \\ \beta_T
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_T
\end{bmatrix}
\end{equation}

\noindent where $y_{t}=(y_{1t},...,y_{Nt})$ is a $N \times 1$ vector, and in our case $y_{st}$ is the incidence ratio in the province $s$ ($s=1,...,N$) the day $t$ $(t=1,...,T)$; $X_t$ is a $N \times k_t$ matrix of the $k_t$ independent variables, with associated vector of coefficients $\beta_t$,; $\beta_t=(\beta_{1t},...,\beta_{Nt})$ is a vector of coefficients and $\epsilon_t=(\epsilon_{1t},...,\epsilon_{Nt})$ is the vector of residuals.
--> 

<!--Equation for README.md generated using this app https://alexanderrodin.com/github-latex-markdown/ -->
![\begin{equation} \label{eq:sur-sim} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_T \end{bmatrix} = \begin{bmatrix} X_1 & 0 & \cdots & 0 \\ 0 & X_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & X_T \end{bmatrix} \ \begin{bmatrix} \beta_1 \\ \beta_1 \\ \vdots \\ \beta_T \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_T \end{bmatrix} \end{equation}](https://render.githubusercontent.com/render/math?math=%5Cbegin%7Bequation%7D%20%5Clabel%7Beq%3Asur-sim%7D%20%5Cbegin%7Bbmatrix%7D%20y_1%20%5C%5C%20y_2%20%5C%5C%20%5Cvdots%20%5C%5C%20y_T%20%5Cend%7Bbmatrix%7D%20%3D%20%5Cbegin%7Bbmatrix%7D%20X_1%20%26%200%20%26%20%5Ccdots%20%26%200%20%5C%5C%200%20%26%20X_2%20%26%20%5Ccdots%20%26%200%20%5C%5C%20%5Cvdots%20%26%20%5Cvdots%20%26%20%5Cddots%20%26%20%5Cvdots%20%5C%5C%200%20%26%200%20%26%20%5Ccdots%20%26%20X_T%20%5Cend%7Bbmatrix%7D%20%5C%20%5Cbegin%7Bbmatrix%7D%20%5Cbeta_1%20%5C%5C%20%5Cbeta_1%20%5C%5C%20%5Cvdots%20%5C%5C%20%5Cbeta_T%20%5Cend%7Bbmatrix%7D%20%2B%20%5Cbegin%7Bbmatrix%7D%20%5Cepsilon_1%20%5C%5C%20%5Cepsilon_2%20%5C%5C%20%5Cvdots%20%5C%5C%20%5Cepsilon_T%20%5Cend%7Bbmatrix%7D%20%5Cend%7Bequation%7D)
  
\noindent where ![$y_{t}=(y_{1t},...,y_{Nt})$](https://render.githubusercontent.com/render/math?math=%24y_%7Bt%7D%3D(y_%7B1t%7D%2C...%2Cy_%7BNt%7D)%24) is an ![$N \times 1$](https://render.githubusercontent.com/render/math?math=%24N%20%5Ctimes%201%24) vector, and in our case ![$y_{st}$](https://render.githubusercontent.com/render/math?math=%24y_%7Bst%7D%24) is the incidence ratio in the province ![$s\text{ }(s=1,...,N)$](https://render.githubusercontent.com/render/math?math=%24s%5Ctext%7B%20%7D(s%3D1%2C...%2CN)%24) the day ![$t\text{ }(t=1,...,T)$](https://render.githubusercontent.com/render/math?math=%24t%5Ctext%7B%20%7D(t%3D1%2C...%2CT)%24); ![$X_t=(X^1,...,X^{k_t})$](https://render.githubusercontent.com/render/math?math=%24X_t%3D(X%5E1%2C...%2CX%5E%7Bk_t%7D)%24) is a ![$N \times k_t$](https://render.githubusercontent.com/render/math?math=%24N%20%5Ctimes%20k_t%24) matrix of the ![$k_t$](https://render.githubusercontent.com/render/math?math=%24k_t%24) independent variables, ![$X_i=(X^i_{st})$](https://render.githubusercontent.com/render/math?math=%24X_i%3D(X%5Ei_%7Bst%7D)%24); ![$\beta_t=(\beta_{1t},...,\beta_{Nt})$](https://render.githubusercontent.com/render/math?math=%24%5Cbeta_t%3D(%5Cbeta_%7B1t%7D%2C...%2C%5Cbeta_%7BNt%7D)%24) is a vector of coefficients and ![$\epsilon_t=(\epsilon_{1t},...,\epsilon_{Nt})$](https://render.githubusercontent.com/render/math?math=%24%5Cepsilon_t%3D(%5Cepsilon_%7B1t%7D%2C...%2C%5Cepsilon_%7BNt%7D)%24) is the vector of residuals.

A key feature of the SUR model is the temporal dependence structure among the vectors of residuals, namely:
  
<!--
\begin{equation}
\label{eq:sur-err}
E[\epsilon_t \epsilon'_{t'}]=\sigma_{tt'}
\end{equation} 

Note that this specification is very flexible, in that it allows changes in the coefficients $\beta_{it}$ in order to modulate the effect of the independent variables on $y_t$. This flexibility can be reduced and it is posible to impose restrictions considering some $\beta$ coefficients as being constant over time. In this case, we can reformulate the coefficients expression $\beta_t = (\beta_{1}, \cdots, \beta_{r-1}, \beta_{r}, \beta_{r+1}, \cdots, \beta_{Nt})$ to restrict the first $r$ coefficients to be constant across equations. This is equivalent to specifying some effects to be invariant over time.
-->

![\begin{equation} \label{eq:sur-err} E\[\epsilon_t \epsilon'_{t'}\]=\sigma_{tt'} \end{equation} ](https://render.githubusercontent.com/render/math?math=%5Cbegin%7Bequation%7D%20%5Clabel%7Beq%3Asur-err%7D%20E%5B%5Cepsilon_t%20%5Cepsilon'_%7Bt'%7D%5D%3D%5Csigma_%7Btt'%7D%20%5Cend%7Bequation%7D%20)

Note that this specification is very flexible, in that it allows changes in the coefficients ![$\beta_{it}$](https://render.githubusercontent.com/render/math?math=%24%5Cbeta_%7Bit%7D%24) in order to modulate the effect of the independent variables on ![$y_t$](https://render.githubusercontent.com/render/math?math=%24y_t%24). This flexibility can be reduced and it is posible to impose restrictions considering some ![$\beta$](https://render.githubusercontent.com/render/math?math=%24%5Cbeta%24) coefficients as being constant over time. In this case, we can reformulate the coefficients expression ![$\beta_t = (\beta_{1}, \cdots, \beta_{r-1}, \beta_{r}, \beta_{r+1}, \cdots, \beta_{Nt})$](https://render.githubusercontent.com/render/math?math=%24%5Cbeta_t%20%3D%20(%5Cbeta_%7B1%7D%2C%20%5Ccdots%2C%20%5Cbeta_%7Br-1%7D%2C%20%5Cbeta_%7Br%7D%2C%20%5Cbeta_%7Br%2B1%7D%2C%20%5Ccdots%2C%20%5Cbeta_%7BNt%7D)%24) to restrict the first ![$r$](https://render.githubusercontent.com/render/math?math=%24r%24) coefficients to be constant across equations. This is equivalent to specifying some effects to be invariant over time.

Equation (\ref{eq:sur-sim}) can be rewriten in compact form:
  
<!--
\begin{equation}
\bf{Y} = \bf{X} \beta + \epsilon
\label{eq:sur-sim-block}
\end{equation}

\noindent where $\bf{Y}$ is now a vector of dimension $NT \times 1$, $\bf{X}$ is a block-diagonal matrix $NT \times K$ 
(with $K = \sum_t{k_t}$) and $\epsilon$ is an $NT \times 1$ vector. Using the Kronecker product notation ($\otimes$), the error matrix structure is expressed concisely as:

\begin{equation}
E[\epsilon \epsilon']=\bf{\Sigma} \otimes \bf{I_N} ; \ \bf{\Sigma}=(\sigma_{tt'})
\end{equation}
-->

![\begin{equation} y = X \beta + \epsilon ![E\[\epsilon \epsilon'\]=\bf{\Sigma} \otimes \bf{I_N} ; \ \bf{\Sigma}=(\sigma_{tt'})](https://render.githubusercontent.com/render/math?math=E%5B%5Cepsilon%20%5Cepsilon'%5D%3D%5Cbf%7B%5CSigma%7D%20%5Cotimes%20%5Cbf%7BI_N%7D%20%3B%20%5C%20%5Cbf%7B%5CSigma%7D%3D(%5Csigma_%7Btt'%7D))
  
\noindent where ![$\bf{Y}$](https://render.githubusercontent.com/render/math?math=%24%5Cbf%7BY%7D%24) is now a vector of dimension ![$NT \times 1$](https://render.githubusercontent.com/render/math?math=%24NT%20%5Ctimes%201%24), ![$\bf{X}$](https://render.githubusercontent.com/render/math?math=%24%5Cbf%7BX%7D%24) is a block-diagonal matrix ![$NT \times K$](https://render.githubusercontent.com/render/math?math=%24NT%20%5Ctimes%20K%24) 
(with ![$K = \sum_t{k_t}$](https://render.githubusercontent.com/render/math?math=%24K%20%3D%20%5Csum_t%7Bk_t%7D%24)) and ![$\epsilon$](https://render.githubusercontent.com/render/math?math=%24%5Cepsilon%24) is an ![$NT \times 1$](https://render.githubusercontent.com/render/math?math=%24NT%20%5Ctimes%201%24) vector. Using the Kronecker product notation (![$\otimes$](https://render.githubusercontent.com/render/math?math=%24%5Cotimes%24)), the error matrix structure is expressed concisely as:

![\begin{equation} E\[\epsilon \epsilon'\]=\Sigma \otimes I_N ; \ \Sigma=(\sigma_{tt'}) \end{equation}](https://render.githubusercontent.com/render/math?math=%5Cbegin%7Bequation%7D%20E%5B%5Cepsilon%20%5Cepsilon'%5D%3D%5CSigma%20%5Cotimes%20I_N%20%3B%20%5C%20%5CSigma%3D(%5Csigma_%7Btt'%7D)%20%5Cend%7Bequation%7D)
  
As is the case with cross-sectional data, it is possible to test the residuals of Model (\ref{eq:sur-sim-block}) for spatial autocorrelation, and several tests have been developed to test the null hypothesis of spatial independence [see @Lopez2014spatial]. When the null hypothesis is rejected, several alternative specifications have been proposed to include spatial effects [@Anselin1988spatial; see also @Anselin2016estimation]. In this paper we consider a spatial SUR model that incorporates a spatial lag of the dependent variable as an explanatory factor. Spatial analytical approaches were used to understand contagion-difussion processes in the case of infectious disease in general [e.g., @Cliff1998detecting] and the 2003-2004 SARS outbreak in particular [e.g., @Meng2005understanding; @Cao2010spatio]. While we are mindful of the same caveat that the novel SARS-CoV2 may not follow the patterns of previous diseases, there is still evidence from the United States that COVID-19 displays spatial patterns that are consistent with a diffusion process [@Desjardins2020rapid]. For this reason, the spatial lag model is appropriate to model incidence of COVID-19 geographically, since it accounts for potential spatial patterns that result from a process of contagion, as explained next. 

The stack expresion for the SUR model with a spatially lagged dependent variable (SUR-SLM) is as follows:

<!--
\begin{equation}
\label{eq:sur-slm}
\begin{aligned}
\bf{AY} = \bf{X} \beta + \epsilon \\
\epsilon =N(0,\bf{\Sigma})
\end{aligned}
\end{equation}

\noindent where $\bf{A} =I_{TN}-\bf{\Gamma} \otimes W$ is the spatially lagged dependent variable, and $\bf{\Gamma} = diag(\rho_1, \cdots, \rho_T)$.

This specification assumes that outcome ($y_{st}$) at location $s$ and time $t$ is partially determined by the weighted average ($Wy_{st}$) of the outcome in neighboring provinces, with neighborhood defined by matrix $W$ of spatial weights. In other words, the spatially lagged dependent variable represents a process of contagion, where the disease in neighboring provinces can spillover in a spatial way. The coefficients of the spatially lagged variable are estimated for each time period $\rho_t$ and identify the intensity and the sign of the contagion effect. It is possible test the null hypothesis of identical levels of spatial dependence ($\rho_i=\rho_j, \forall i,j$). The correspond Wald test is available in the `R` package `spsur`. 
--> 
                                                                                                                
![\begin{equation} \label{eq:sur-slm} \begin{aligned} \bf{AY} = \bf{X} \beta + \epsilon \\ \epsilon =N(0,\bf{\Sigma}) \end{aligned} \end{equation}](https://render.githubusercontent.com/render/math?math=%5Cbegin%7Bequation%7D%20%5Clabel%7Beq%3Asur-slm%7D%20%5Cbegin%7Baligned%7D%20%5Cbf%7BAY%7D%20%3D%20%5Cbf%7BX%7D%20%5Cbeta%20%2B%20%5Cepsilon%20%5C%5C%20%5Cepsilon%20%3DN(0%2C%5Cbf%7B%5CSigma%7D)%20%5Cend%7Baligned%7D%20%5Cend%7Bequation%7D)

\noindent where ![$\bf{A} =I_{TN}-\bf{\Gamma} \otimes W$](https://render.githubusercontent.com/render/math?math=%24%5Cbf%7BA%7D%20%3DI_%7BTN%7D-%5Cbf%7B%5CGamma%7D%20%5Cotimes%20W%24) is the spatially lagged dependent variable, and ![$\bf{\Gamma} = diag(\rho_1, \cdots, \rho_T)$](https://render.githubusercontent.com/render/math?math=%24%5Cbf%7B%5CGamma%7D%20%3D%20diag(%5Crho_1%2C%20%5Ccdots%2C%20%5Crho_T)%24).

This specification assumes that outcome (![$y_{st}$](https://render.githubusercontent.com/render/math?math=%24y_%7Bst%7D%24)) at location ![$s$](https://render.githubusercontent.com/render/math?math=%24s%24) and time ![$t$](https://render.githubusercontent.com/render/math?math=%24t%24) is partially determined by the weighted average (![$Wy_{st}$](https://render.githubusercontent.com/render/math?math=%24Wy_%7Bst%7D%24)) of the outcome in neighboring provinces, with neighborhood defined by matrix ![$W$](https://render.githubusercontent.com/render/math?math=%24W%24) of spatial weights. In other words, the spatially lagged dependent variable represents a process of contagion, where the disease in neighboring provinces can spillover in a spatial way. The coefficients of the spatially lagged variable are estimated for each time period ![$\rho_t$](https://render.githubusercontent.com/render/math?math=%24%5Crho_t%24) and identify the intensity and the sign of the contagion effect. It is possible test the null hypothesis of identical levels of spatial dependence (![$\rho_i=\rho_j, \forall i,j$](https://render.githubusercontent.com/render/math?math=%24%5Crho_i%3D%5Crho_j%2C%20%5Cforall%20i%2Cj%24)). The correspond Wald test is available in the `R` package `spsur`.
  
The SUR-SLM model can be estimated using maximum likelihood (@Lopez2014spatial) or instrumental variables (@Minguez2019).

A note regarding the interpretation of the model is in order. It is well-known that coefficients in linear regression models are partial derivatives of the dependent variable with respect to the independent variables, and therefore directly give the marginal effects or rates of change. Spatially lagged models, however, are no longer linear. The intuition behind the non-linearity is that the spatial lag expands the information set to include information from neighbouring regions: in other words, the value of an explanatory variable in a spatial unit can have influence in other spatial units via the spatial lag. This makes interpretation of the coefficients less straightforward but also richer [@Golgher2016interpret]. The results of LeSage [-@LeSage2009introduction] for cross-sectional spatial lag models can be extended to the spatial SUR framework. Note that, according to Elhorst [-@Elhorst2014spatial], the partial derivatives have the following interpretation: if an explanatory variable ($X_k$) in a particular province changes, not only the incidence rate in that province changes, also incidence rates in other provinces change via the contagion effect. Therefore, a change in $X_k$ in a particular province has a *direct effect* on that province, but also an *indirect effect* on neighbouring provinces. In this way, the $i$th diagonal element of the matrix of partial derivatives represents the direct effect on the $i$th province, whereas the non-diagonal elements of $i$th row of the matrix of partial derivatives represent the indirect effects on other provinces. In order to obtain a global indicator, the direct effect is calculated as the mean of the diagonal elements and captures the average change in incidence ratio caused by the change in $X_k$. Likewise, a global indicator of the indirect effects is calculated as the mean of the non-diagonal elements. The total effect is the sum of direct and indirect effects. Finally, note that if $\rho_k = 0$, the indirect effects are 0 and the direct effects are equal to $\beta_kt$.

Results {#results}
===================
  
## Exploratory Data Analysis {#eda} 
  
Figure \ref{fig:weekly-average-incidence-map} shows the geographical variation in the incidence of COVID-19 in Spain, as well as the temporal progression of the disease in weekly averages. Our analysis begins on March 13. Albeit still low, the highest incidence at this early date was in the provice of Álava, in the North of Spain. Álava is not the most populous province, with a population of only 331,549, but it has the highest GDP per capita of all provinces. Vitoria, its main city, is the capital of the Basque Country and has been the focus of efforts, along with San Sebastian and Bilbao, to develop a "Global Basque City" [@Meijers2008strategic]. The other early focus of the pandemic in Spain was Madrid, which is the most populous province in the country and has the second highest GDP per capita after Álava. The early outbreaks in these two provinces can be traced throughout the progression of the pandemic over time, although by the end of the period under study, other provinces had matched and even surpased their incidence rates, including Segovia and Soria to the north of Madrid, and Ciudad Real and Albacete to the south.

```{r weekly-average-incidence-map, echo=FALSE, fig.height= 9, fig.cap="\\label{fig:weekly-average-incidence-map}Mean weekly incidence of COVID-19 by province, in reported cases by 100,000 people", warning=FALSE, cache = TRUE}
# Retrieve the coordinates of Madrid
madrid_xy <-  st_centroid(provinces_spain %>% filter(province == "Madrid")) %>% st_coordinates()


week11.plot <- covid19_spain %>%
  filter(CCAA != "Canarias", province != "Baleares") %>%
  group_by(province, week = isoweek(Date)) %>% 
  summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
  filter(week == 11) %>%
  left_join(provinces_spain,
            by = c("ID_INE")) %>%
  st_as_sf() %>%
  ggplot() +
  geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
  scale_fill_distiller(name = "Mean Weekly Incidence", 
                       palette = "Reds", 
                       direction = 1,
                       limits = c(0.755, 988.71)) +
  ggtitle("March 13 - March 15") +
  theme_bw(base_size = 9) +
  theme(axis.text = element_blank(),
        legend.position = "bottom",
        plot.margin = margin(0, 0, 0, 0, "cm")) +
  geom_sf_text(label = "Madrid", 
               x = madrid_xy[1] - 4.7, 
               y = madrid_xy[2] + 0.6,
               family = "serif") +
  geom_segment(x = madrid_xy[1] - 3.4, 
               y = madrid_xy[2] + 0.6, 
               xend = madrid_xy[1], 
               yend = madrid_xy[2])

week12.plot <- covid19_spain %>%
  filter(CCAA != "Canarias", province != "Baleares") %>%
  group_by(province, week = isoweek(Date)) %>% 
  summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
  filter(week == 12) %>%
  left_join(provinces_spain,
            by = c("ID_INE")) %>%
  st_as_sf() %>%
  ggplot() +
  geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
  scale_fill_distiller(name = "Mean Weekly Incidence", 
                       palette = "Reds",
                       direction = 1,
                       limits = c(0.755, 988.71)) +
  ggtitle("March 16 - March 22") +
  theme_bw(base_size = 9) +
  theme(axis.text = element_blank(),
        legend.position = "bottom",
        plot.margin = margin(0, 0, 0, 0, "cm")) +
  geom_sf_text(label = "Madrid", 
               x = madrid_xy[1] - 4.7, 
               y = madrid_xy[2] + 0.6) +
  geom_segment(x = madrid_xy[1] - 3.4,
               y = madrid_xy[2] + 0.6, 
               xend = madrid_xy[1], 
               yend = madrid_xy[2])

week13.plot <- covid19_spain %>%
  filter(CCAA != "Canarias", province != "Baleares") %>%
  group_by(province, week = isoweek(Date)) %>% 
  summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
  filter(week == 13) %>%
  left_join(provinces_spain,
            by = c("ID_INE")) %>%
  st_as_sf() %>%
  ggplot() +
  geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
  scale_fill_distiller(name = "Mean Weekly Incidence", 
                       palette = "Reds", 
                       direction = 1,
                       limits = c(0.755, 988.71)) +
  ggtitle("March 23 - March 29") +
  theme_bw(base_size = 9) +
  theme(axis.text = element_blank(),
        legend.position = "bottom",
        plot.margin = margin(0, 0, 0, 0, "cm")) +
  geom_sf_text(label = "Madrid",
               x = madrid_xy[1] - 4.7, 
               y = madrid_xy[2] + 0.6,
               family = "serif") +
  geom_segment(x = madrid_xy[1] - 3.4, 
               y = madrid_xy[2] + 0.6, 
               xend = madrid_xy[1], 
               yend = madrid_xy[2])

week14.plot <- covid19_spain %>%
  filter(CCAA != "Canarias", province != "Baleares") %>%
  group_by(province, week = isoweek(Date)) %>% 
  summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
  filter(week == 14) %>%
  left_join(provinces_spain,
            by = c("ID_INE")) %>%
  st_as_sf() %>%
  ggplot() +
  geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
  scale_fill_distiller(name = "Mean Weekly Incidence", 
                       palette = "Reds", 
                       direction = 1,
                       limits = c(0.755, 988.71)) +
  ggtitle("March 30 - April 5") +
  theme_bw(base_size = 9) +
  theme(axis.text = element_blank(),
        legend.position = "bottom",
        plot.margin = margin(0, 0, 0, 0, "cm")) +
  geom_sf_text(label = "Madrid", 
               x = madrid_xy[1] - 4.7, 
               y = madrid_xy[2] + 0.6,
               family = "serif") +
  geom_segment(x = madrid_xy[1] - 3.4, 
               y = madrid_xy[2] + 0.6, 
               xend = madrid_xy[1], 
               yend = madrid_xy[2])

week15.plot <- covid19_spain %>%
  filter(CCAA != "Canarias", province != "Baleares") %>%
  group_by(province, week = isoweek(Date)) %>% 
  summarise(ID_INE = first(ID_INE), mean_weekly_incidence = mean(Incidence)) %>%
  filter(week == 15) %>%
  left_join(provinces_spain,
            by = c("ID_INE")) %>%
  st_as_sf() %>%
  ggplot() +
  geom_sf(aes(fill = mean_weekly_incidence), color = "gray69") +
  scale_fill_distiller(name = "Mean Weekly Incidence", 
                       palette = "Reds", 
                       direction = 1,
                       limits = c(0.755, 988.71)) +
  ggtitle("April 6 - April 11") +
  theme_bw(base_size = 9) +
  theme(axis.text = element_blank(),
        legend.position = "bottom",
        plot.margin = margin(0, 0, 0, 0, "cm")) +
  geom_sf_text(label = "Madrid", 
               x = madrid_xy[1] - 4.7, 
               y = madrid_xy[2] + 0.6,
               family = "serif") +
  geom_segment(x = madrid_xy[1] - 3.4, 
               y = madrid_xy[2] + 0.6, 
               xend = madrid_xy[1], 
               yend = madrid_xy[2])

# Create grid layout
lay = rbind(c(1, 1, 2, 2),
        c(1, 1, 2, 2),
        c(3, 3, 4, 4),
        c(3, 3, 4, 4),
        c(NA, 5, 5, NA),
        c(NA, 5, 5, NA))

grid.arrange(week11.plot, week12.plot, week13.plot, week14.plot, week15.plot, layout_matrix = lay)
```

Figure \ref{fig:descriptives-temperature} shows the distribution of the environmental variables in Spain. For ease of visualization we aggregate the provinces by Autonomous Community. Each box-and-whisker in the figure represents the distribution of the variable for a community over the 30-day period. In the plot, the communities have been sorted by latitude, so that Principado de Asturias is the northernmost community, and Andalucia the southernmost. As seen in the figure, there is a relatively wide range of values both within and between provinces over the 30-day period examined. The top panel of the figure shows the distribution of mean temperatures. The lowest mean temperature for a community on any given day was approximately 3°C, and the highest about 20°C, for a range of approximately 17 degrees. Likewise, there is a great deal of variability in humidity, as seen in the middle panel of the figure, where the lowest mean humidity for any community is approximately 48% and the highest is close to 100%. Finally, the bottom panel displays mean daily hours of sunshine in the community. This variable displays the most variability within communities over time, but remains relatively constant across communities. It is important to note that these are summaries by community, and the actual values of these variables for the provinces display somewhat more variability.

```{r descriptives-temperature, echo = FALSE, fig.height=7, fig.cap="\\label{fig:descriptives-temperature} Distribution of mean temperatures and humidities in the Autonomous Communities in Spain between March 12, 2020 and April 11, 2020. The Autonomous Communities have been sorted by latitude, with communities to the left being the northermost, and to the right the southernmost", cache = TRUE}
# Summary of temperature by Autonomous Community
temp.summary <- covid19_spain %>% 
  filter(CCAA != "Canarias" , ) %>%
  group_by(CCAA, Date) %>%
  summarize(Value = mean(Mean_Temp)) %>%
  mutate(Variable = "Mean Temperature in Community")

# Summary of humidity by Autonomous Community
humidity.summary <- covid19_spain %>% 
  filter(CCAA != "Canarias") %>%
  group_by(CCAA, Date) %>%
  summarize(Value = mean(Humidity)) %>%
  mutate(Variable = "Mean Humidity in Community")

# Summary of hours of sunshine by Autonomous Community
sunshine.summary <- covid19_spain %>% 
  filter(CCAA != "Canarias") %>%
  group_by(CCAA, Date) %>%
  summarize(Value = mean(Sunshine_Hours)) %>%
  mutate(Variable = "Mean Daily Hours of Sunshina in Community")

# Bind tables
env.summary <- rbind(temp.summary, humidity.summary, sunshine.summary) %>%
  mutate(Variable = factor(Variable, 
                           levels = c("Mean Temperature in Community",
                                      "Mean Humidity in Community",
                                      "Mean Daily Hours of Sunshina in Community")))

# Plot
ggplot(data = env.summary, 
       aes(x = CCAA,
           y = Value)) +
  geom_boxplot() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Autonomous Communities in Spain (sorted from north to south)") +
  ylab("Value") +
  facet_wrap(~ Variable, nrow = 3, 
             scales = "free_y")
```

Figure \ref{fig:map-controls} includes three maps that display the spatial variation of our control variables, namely GDP per capita, percentage of older adults in province, population density, and presence of mass transit systems. As seen there, GDP per capita is higher in Madrid and the northeast part of the country, mainly in Pais Vasco and Cataluña. Percentage of older adults is somewhat more checkered, with high values in Madrid and other provinces in the center-west part of the country, but also in some provinces in the north. Outside of provinces with major cities (e.g., Madrid; Bizkaia and Gipuzkoa in Pais Vasco; Pontevedra in Galicia), population density tends to be higher in provinces along the Mediterranean coast. The final panel in the figure shows the five provinces in the country that have higher order mass transit systems (e.g., metro).

```{r map-controls, cache=TRUE, echo=FALSE, fig.height= 9, fig.cap="\\label{fig:map-controls}Spatial distribution of control variables by province", warning=FALSE, cache = TRUE}
# GDP per capita:
q <- quantile(provinces_spain$GDPpc)
gdppc.plot <- provinces_spain %>%
  mutate(GDPpc = GDPpc,
         GDPpc = case_when(GDPpc < q[2] ~ "1st",
                             GDPpc > q[2] & GDPpc <= q[3]  ~ "2nd",
                             GDPpc > q[3] & GDPpc <= q[4] ~ "3rd",
                             GDPpc > q[4] ~ "4th")) %>%
  filter(CCAA != "Canarias", province != "Baleares") %>%
  ggplot() +
  geom_sf(aes(fill = GDPpc), color = "gray69") +
  scale_fill_manual(name = "Quartile", values=c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D")) +
  ggtitle("GDP per capita") +
  theme_bw(base_size = 9) +
  theme(axis.text = element_blank(),
        legend.position = "bottom") +
  geom_sf_text(label = "Madrid", 
               x = madrid_xy[1] - 4.7, 
               y = madrid_xy[2] + 0.6,
               family = "serif") +
  geom_segment(x = madrid_xy[1] - 3.4, 
               y = madrid_xy[2] + 0.6, 
               xend = madrid_xy[1], 
               yend = madrid_xy[2])

# Older adults:
q <- quantile(provinces_spain$Older)
older.plot <- provinces_spain %>% 
  mutate(Older = Older,
         Older = case_when(Older < q[2] ~ "1st",
                             Older > q[2] & Older <= q[3]  ~ "2nd",
                             Older > q[3] & Older <= q[4] ~ "3rd",
                             Older > q[4] ~ "4th")) %>%
  filter(CCAA != "Canarias", province != "Baleares") %>%
  ggplot() +
  geom_sf(aes(fill = Older), color = "gray69") +
  scale_fill_manual(name = "Quartile", values=c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D")) +
  ggtitle("Percentage of older adults") +
  theme_bw(base_size = 9) +
  theme(axis.text = element_blank(),
        legend.position = "bottom") +
  geom_sf_text(label = "Madrid", 
               x = madrid_xy[1] - 4.7, 
               y = madrid_xy[2] + 0.6,
               family = "serif") +
  geom_segment(x = madrid_xy[1] - 3.4, 
               y = madrid_xy[2] + 0.6, 
               xend = madrid_xy[1], 
               yend = madrid_xy[2])

# Population density:
q <- as.numeric(quantile(provinces_spain$Density))
density.plot <- provinces_spain %>% 
  mutate(Density = as.numeric(Density),
         Density = case_when(Density < q[2] ~ "1st",
                             Density > q[2] & Density <= q[3]  ~ "2nd",
                             Density > q[3] & Density <= q[4] ~ "3rd",
                             Density > q[4] ~ "4th")) %>%
  filter(CCAA != "Canarias", province != "Baleares") %>%
  ggplot() +
  geom_sf(aes(fill = Density), color = "gray69") +
  scale_fill_manual(name = "Quartile", values=c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D")) +
  ggtitle("Population density") +
  theme_bw(base_size = 9) +
  theme(axis.text = element_blank(),
        legend.position = "bottom") +
  geom_sf_text(label = "Madrid", 
               x = madrid_xy[1] - 4.7, 
               y = madrid_xy[2] + 0.6,
               family = "serif") +
  geom_segment(x = madrid_xy[1] - 3.4, 
               y = madrid_xy[2] + 0.6, 
               xend = madrid_xy[1], 
               yend = madrid_xy[2])

# Transit systems:
transit.plot <- provinces_spain %>% 
  mutate(Transit = factor(Transit, 
                          levels = c(0, 1),
                          labels = c("No",
                                     "Yes"))) %>%
  filter(CCAA != "Canarias", province != "Baleares") %>%
  ggplot() +
  geom_sf(aes(fill = Transit), color = "gray69") +
  scale_fill_manual(name = "Transit", values=c("#FFFEDE", "#D5610D")) +
  ggtitle("Mass transit systems") +
  theme_bw(base_size = 9) +
  theme(axis.text = element_blank(),
        legend.position = "bottom") +
  geom_sf_text(label = "Madrid", 
               x = madrid_xy[1] - 4.7, 
               y = madrid_xy[2] + 0.6,
               family = "serif") +
  geom_segment(x = madrid_xy[1] - 3.4, 
               y = madrid_xy[2] + 0.6, 
               xend = madrid_xy[1], 
               yend = madrid_xy[2])

# Define layout for plot
lay <- rbind(c(1, 2),
             c(3, 4))

grid.arrange(gdppc.plot, older.plot, density.plot, transit.plot, layout_matrix = lay)
```

Figure \ref{fig:daily-correlations} shows the distribution of daily simple correlations of incidence of COVID-19 with the independent variables (with the exception of Transit, which is a categorical variable). These correlations are calculated after log-transforming all variables. As previously discussed, the environmental variables have a temporal lag and were calculated using different time windows. 

It can be seen in the figure that temperature (in its three forms) has the highest simple correlation with incidence of COVID-19. After temperature, GDP per capita has the highest positive correlation with the dependent variable. The distribution of these correlations is also quite tight over the 30-day period of the study. Hours of sunshine tends to have a moderately high correlation with incidence of COVID-19, but the distribution of these correlations is more spread, and in some cases strays into negative values. A similar thing happens with humidity, which also tends to display more day to day variation in the correlation with the dependent variable. The percentage of older adults shows a relatively tight distribution of day-to-day correlations, and is negative. Population density, in contrast, tends to be negative, but is relatively spread, and on some days, the simple correlation between density and incidence of COVID-19 is weakly positive. Overall for the period under examination, the pairwise correlations  between these variables and incidence are significant at $p<0.05$, with the exception of the three Sunshine_hours variables.

```{r daily-correlations, echo=FALSE, fig.cap="\\label{fig:daily-correlations}Distribution of daily correlations of the independent variables with daily incidence of COVID-19 (all variables have been log-transformed)", cache = TRUE}
covid19_spain %>% 
  filter(CCAA != "Canarias", province != "Baleares") %>%
  group_by(Date) %>%
  summarize(GDPpc = cor(log(GDPpc), log(Incidence)),
            Older = cor(log(Older), log(Incidence)),
            Density = cor(log(Density), log(Incidence)),
            Humidity_lag8 = cor(Humidity_lag8, Incidence),
            Humidity_lag11 = cor(log(Humidity_lag11), log(Incidence)),
            Humidity_lag11w = cor(log(Humidity_lag11w), log(Incidence)),
            Mean_Temp_lag8 = cor(log(Mean_Temp_lag8), log(Incidence)),
            Mean_Temp_lag11 = cor(log(Mean_Temp_lag11), log(Incidence)),
            Mean_Temp_lag11w = cor(log(Mean_Temp_lag11w), log(Incidence)),
            Sunshine_Hours_lag8 = cor(log(Sunshine_Hours_lag8 + 0.1), log(Incidence)),
            Sunshine_Hours_lag11 = cor(log(Sunshine_Hours_lag11 + 0.1), log(Incidence)),
            Sunshine_Hours_lag11w = cor(log(Sunshine_Hours_lag11w + 0.1), log(Incidence))) %>%
  select(-Date) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Correlation") %>%
  mutate(Variable = factor(Variable,
                           levels = c("GDPpc", 
                                      "Older", 
                                      "Density", 
                                      "Humidity_lag8",
                                      "Humidity_lag11",
                                      "Humidity_lag11w",
                                      "Mean_Temp_lag8",
                                      "Mean_Temp_lag11",
                                      "Mean_Temp_lag11w",
                                      "Sunshine_Hours_lag8",
                                      "Sunshine_Hours_lag11",
                                      "Sunshine_Hours_lag11w"))) %>%
  ggplot(aes(x = Variable, y = Correlation)) +
  geom_boxplot() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))
```

```{r correlation-tests, include=FALSE}
cor.test(log(covid19_spain$GDPpc), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Older), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Density), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Humidity_lag8), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Humidity_lag11), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Humidity_lag11w), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Mean_Temp_lag8), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Mean_Temp_lag11), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Mean_Temp_lag11w), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Sunshine_Hours_lag8 + 0.1), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Sunshine_Hours_lag11 + 0.1), log(covid19_spain$Incidence))
cor.test(log(covid19_spain$Sunshine_Hours_lag11w + 0.1), log(covid19_spain$Incidence))
```

## SUR Models {#sur-models}

Correlation analysis in the preceding section provides some insights about the potential associations between incidence of COVID-19 and the various environmental and control variables. In this section we estimate three spatial SUR models to test the differences between the various temporal lags and weighting schemes for the environmental variables. Accordingly, we define three models: Model 1, which is estimated using the lagged 8-day averages of the environmental variables (*lag8*); Model 2, which is estimated using the lagged 11-day averages of the environmental variables (*lag11*); and finally, Model 3, which is estimated using the lagged 11-day *weighted* averages of the environmental variables (*lag11w*).

To implement the SUR approach, we must define a matrix of spatial weights $W$. In this case, we consider neighborhoods based on adjacency, based on the well-known queen criterion (two provinces are adjacent if they share a boundary or touch at a vertex). The analysis is of the coterminous provinces\footnote{As a check for robustness, we also tested the rook criterion (two provinces are adjacent if they share a boundary, but not if they only touch at a vertex), and included the three islands in the sample. In this case we made an allowance for adjacency between the two islands in the Autonomous Community of Canarias in the Pacific (Las Palmas and Santa Cruz de Tenerife), and assumed that Islas Baleares in the Mediterranean are adjacent to three provinces in Pais Catalans (i.e., Barcelona, Tarragona, and Castello) The results (which can be consulted in the source R markdown document) are robust to the specification of $W$.}.

For estimation, we log-transform the dependent and quantitative independent variables. The only variable that is not transformed is the categorical variable for transit systems. A log-log transformation is appropriate to capture non-linear relationships between variables and provides a straightforward interpretation of the coefficients as percentage change. Furthermore, we introduce restrictions so that the coefficients of two of our control variables are constant over time, namely GDP per capita and percentage of older adults. We do not see an  *a priori* reason to let those two variables vary across equations, and the correlation analysis in Figure \ref{fig:daily-correlations} also suggest little temporal variation. In contrast, we let the spatial autocorrelation parameter, as well as the parameters of the rest of the independent variables (including the constant) to vary over time\footnote{We conducted sensitivity analysis letting all parameters vary over time, and while the results are qualitatively similar, the resulting models are less parsimonious. These results are available in the source R markdown document.}. This will be useful to detect whether there are behavioral adaptations at the population level over the course of the period examined. As an example of behavioral adaptations, the effect of density might weaken over time, in the measure that the effects of the lockdown are felt: at full compliance with the lockdown, with people practicing social avoidance, density might matter less than other factors.

```{r data-preparation-for-modelling, eval = FALSE, include=FALSE, cache = TRUE}
# Organize data for SUR modelling 
GPanel <- plm::pdata.frame(covid19_spain %>%
                             select(province, 
                                    Date,
                                    Incidence, 
                                    Median_Age,
                                    Male2Female,
                                    Older,
                                    GDPpc,
                                    Density,
                                    Transit,
                                    Mean_Temp_lag8,
                                    Humidity_lag8,
                                    Sunshine_Hours_lag8,
                                    Mean_Temp_lag11,
                                    Humidity_lag11,
                                    Sunshine_Hours_lag11,
                                    Mean_Temp_lag11w,
                                    Humidity_lag11w,
                                    Sunshine_Hours_lag11w),
                           c("province","Date"))

```

```{r spatial-weights, eval = FALSE, include=FALSE, cache = TRUE}
# Create spatial weights matrix: 
Wmat <- provinces_spain %>%
  #drop_na() %>%
  as("Spatial") %>%
  #poly2nb(queen = FALSE) %>%
  poly2nb(queen = TRUE) %>%
  nb2mat(zero.policy = TRUE)

Wmat <- (Wmat > 0) * 1

# Join the two provinces in Canarias
Wmat[which(provinces_spain$province == "Palmas(Las)"), 
     which(provinces_spain$province == "Santa Cruz de Tenerife")] <- 1
Wmat[which(provinces_spain$province == "Santa Cruz de Tenerife"), 
     which(provinces_spain$province == "Palmas(Las)")] <- 1

# 'Paises Catalans'
#n = 8
Wmat[which(provinces_spain$province == "Barcelona"), 
     which(provinces_spain$province == "Baleares")] <- 1
Wmat[which(provinces_spain$province == "Baleares"), 
     which(provinces_spain$province == "Barcelona")] <- 1
Wmat[which(provinces_spain$province == "Baleares"), 
     which(provinces_spain$province == "Castellon/Castello")] <- 1 
Wmat[which(provinces_spain$province == "Castellon/Castello"), 
     which(provinces_spain$province == "Baleares")] <- 1
Wmat[which(provinces_spain$province == "Baleares"), 
     which(provinces_spain$province == "Tarragona")] <- 1 
Wmat[which(provinces_spain$province == "Tarragona"), 
     which(provinces_spain$province == "Baleares")] <- 1
miW <- Wmat/rowSums(Wmat)

# Convert to listw
listw <- mat2listw(Wmat,style = "W")
```

```{r data-preparation-for-modelling-minus-islands, eval = TRUE, include=FALSE, cache = TRUE}
# Organize data for SUR modelling 
GPanel <- plm::pdata.frame(covid19_spain %>%
                             filter(province != "Baleares", CCAA != "Canarias") %>%
                             select(province, 
                                    Date,
                                    Incidence, 
                                    Median_Age,
                                    Male2Female,
                                    Older,
                                    GDPpc,
                                    Density,
                                    Transit,
                                    Mean_Temp_lag8,
                                    Humidity_lag8,
                                    Sunshine_Hours_lag8,
                                    Mean_Temp_lag11,
                                    Humidity_lag11,
                                    Sunshine_Hours_lag11,
                                    Mean_Temp_lag11w,
                                    Humidity_lag11w,
                                    Sunshine_Hours_lag11w),
                           c("province","Date"))

```

```{r spatial-weights-minus-islands, eval=TRUE, include=FALSE, cache = TRUE}
# Create spatial weights matrix: 
Wmat <- provinces_spain %>%
  filter(province != "Baleares", CCAA != "Canarias") %>%
  #drop_na() %>%
  as("Spatial") %>%
  poly2nb(queen = TRUE) %>%
  nb2mat(zero.policy = TRUE)

#Wmat <- (Wmat > 0) * 1

#miW <- Wmat/rowSums(Wmat)

# Convert to listw
listw <- mat2listw(Wmat,style = "W")
```

```{r formulas, include=FALSE, cache = TRUE}
# Define formulas with three different lagged variables:
formula_lag8 <- log(Incidence) ~ 
  log(GDPpc) +
  log(Older) +
  log(Density) +
  Transit +
  log(Humidity_lag8) +
  log(Mean_Temp_lag8) +
  log(Sunshine_Hours_lag8 + 0.1)

formula_lag11 <- log(Incidence) ~ 
  log(GDPpc) +
  log(Older) +
  log(Density) +
  Transit +
  log(Humidity_lag11) +
  log(Mean_Temp_lag11) +
  log(Sunshine_Hours_lag11 + 0.1)

formula_lag11w <- log(Incidence) ~ 
  log(GDPpc) +
  log(Older) +
  log(Density) +
  Transit +
  log(Humidity_lag11w) +
  log(Mean_Temp_lag11w) +
  log(Sunshine_Hours_lag11w + 0.1)
```
```{r model-restrictions, include=FALSE, cache = TRUE}
T <- max(covid19_spain$Date) - min(covid19_spain$Date) + 1 # Recall that T is the number of days, i.e., time periods, i.e., equations; add 1 to include the starting day
k <- 8 # Number of independent variables, including the constant
coef_rest <- 2 # Number of restrictions

# nrow is number of equations (time periods) minus 1, times the number of restrictions
# ncol is number of variables times number of equations
R2 <- matrix(0, nrow = (T - 1) * coef_rest, ncol = k * T)

for (i in 1:(T-1)){
  R2[i, 2] <- 1
  R2[i, (2 + i * k)] <- -1
  R2[(i + T - 1), 3] <- 1
  R2[(i + T - 1), (3 + i * k)] <- -1
  # Use if more restrictions are needed
  #R2[(i + T - 1) * 2, 4] <- 1
  #R2[(i + T - 1) * 2, (4 + i * k)] <- -1
}
b2 <- matrix(0, ncol = (T - 1) * coef_rest)
```

```{r model1-8-day-moving-average, include=FALSE, cache = TRUE}
# Model with a lagged 8-day moving average of climatic variables:
sur.slm_lag8 <- spsur::spsurtime(formula = formula_lag8, 
                                 data=GPanel, 
                                 time = GPanel$Date, 
                                 type = "slm", 
                                 fit_method = "3sls", 
                                 listw=  listw,
                                 R = R2,
                                 b = b2)
#summary(sur.slm_lag8)
#print(paste("Pooled R^2 = ", sur.slm_lag8$R2[1]))
```

```{r model1-8-day-moving-average-all, include=FALSE, eval = FALSE}
# Model with a lagged 8-day moving average of climatic variables - NO RESTRICTIONS TO THE COEFFICIENTS:
sur.slm_lag8_all <- spsur::spsurtime(formula = formula_lag8, 
                                 data=GPanel, 
                                 time = GPanel$Date, 
                                 type = "slm", 
                                 fit_method = "3sls", 
                                 listw=  listw)
```

```{r model2-11-day-moving-average, include=FALSE, cache = TRUE}
# Model with 11-day moving average of climatic variables:
sur.slm_lag11 <- spsur::spsurtime(formula = formula_lag11, 
                                  data=GPanel, 
                                  time = GPanel$Date, 
                                  type = "slm", 
                                  fit_method = "3sls", 
                                  listw=  listw,
                                  R = R2,
                                  b = b2)
#summary(sur.slm_lag11)
#print(paste("Pooled R^2 = ", sur.slm_lag11$R2[1]))
```

```{r model2-11-day-moving-average-all, include=FALSE, cache = TRUE}
# Model with 11-day moving average of climatic variables:
sur.slm_lag11_all <- spsur::spsurtime(formula = formula_lag11, 
                                  data=GPanel, 
                                  time = GPanel$Date, 
                                  type = "slm", 
                                  fit_method = "3sls", 
                                  listw=  listw)
```

```{r model3-11-day-weighted-moving-average, include=FALSE, cache = TRUE}
# Model with 11-day weighted moving average of climatic variables:
sur.slm_lag11w <- spsur::spsurtime(formula = formula_lag11w, 
                                   data=GPanel, 
                                   time = GPanel$Date, 
                                   type = "slm", 
                                   fit_method = "3sls", 
                                   listw=  listw,
                                   R = R2,
                                   b = b2)
#summary(sur.slm_lag11w)
#print(paste("Pooled R^2 = ", sur.slm_lag11w$R2[1]))
```

```{r model3-11-day-weighted-moving-average-all, include=FALSE, cache = TRUE}
# Model with 11-day weighted moving average of climatic variables:
sur.slm_lag11w_all <- spsur::spsurtime(formula = formula_lag11w, 
                                   data=GPanel, 
                                   time = GPanel$Date, 
                                   type = "slm", 
                                   fit_method = "3sls", 
                                   listw=  listw)
```

After estimation, we compare the goodness of fit of the three SUR models. Figure \ref{fig:goodness-of-fit} shows the equation-level coefficient of determination $R^2$, one for each time period/day. As well, the overall coefficient of determination for the system is reported for each model $\text{pooled}-R^2$. The general trend is identical for the three models, with the goodness-of-fit improving over time and plateuing around a value of $R^2$ of 0.55. Model 1 (*lag8*) performs somewhat better in the first few equations/days, when the goodness-of-fit is relatively poor, and then again in the last few equations/days. Model 3 (*lag11w*), in contrast, does not perform well towards the end of the study period. The most balanced model in terms of equation-level goodness-of-fit appears to be Model 1 (*lag8*), and this impression is further supported by a slightly higher value of the $\text{pooled}-R^2$. The analysis using a lagged moving average of the environmental variables is generally in line with the incubation period reported by Lauer et al. [-@Lauer2020incubation], although the results do not support the use of a weighted average. For the remainder of the paper, we will adopt Model 1 (*lag8*) as our best model. In the following section we discuss the results of the analysis in more depth.

```{r goodness-of-fit, cache = TRUE, echo=FALSE, fig.cap="\\label{fig:goodness-of-fit} Goodness of fit of the SUR systems: by date and pooled"}
data.frame(Model1 = sur.slm_lag8$R2, 
           Model2 = sur.slm_lag11$R2,
           Model3 = sur.slm_lag11w$R2) %>%
  slice(2:n()) %>%
  rownames_to_column(var = "Equation") %>%
  mutate(Date = seq(ymd("2020-03-13"), 
                    ymd("2020-04-11"), 
                    by = "days")) %>%
  pivot_longer(cols = starts_with("M"), names_to = "Model", values_to = "R2") %>%
  mutate(Model = factor(Model, 
                        levels = c("Model1", "Model2", "Model3"),
                        labels = c("Model 1: lag8", "Model 2: lag11", "Model 3: lag11w"))) %>%
  ggplot(aes(x = Date, y = R2, color = Model, shape = Model)) +
  geom_point(size = 2) +
  scale_color_manual(values = c("Model 1: lag8" = "blue", "Model 2: lag11" = "orange", "Model 3: lag11w" = "green") ) + 
  annotate(geom = "text", 
           label = c(paste("pooled-R2 (Model 1: lag8)=", 
                           round(sur.slm_lag8$R2[1], 4)),
                     paste("pooled-R2 (Model 2: lag11)=", 
                           round(sur.slm_lag11$R2[1], 4)),
                     paste("pooled-R2 (Model 3: lag11w)=", 
                           round(sur.slm_lag11w$R2[1], 4))),
           x = as_date("2020-04-01"),
           y = c(0.3, 0.25, 0.20),
           family = "serif",
           size = 5) +
  theme_bw() +
  theme(legend.position = "bottom")
```

```{r goodness-of-fit-all, eval=FALSE, echo=FALSE, fig.cap="\\label{fig:goodness-of-fit-all} Goodness of fit of the SUR systems NO RESTRICTION TO THE COEFFICIENTS: by date and pooled"}
data.frame(Model1 = sur.slm_lag8_all$R2, 
           Model2 = sur.slm_lag11_all$R2,
           Model3 = sur.slm_lag11w_all$R2) %>%
  slice(2:n()) %>%
  rownames_to_column(var = "Equation") %>%
  mutate(Date = seq(ymd("2020-03-13"), 
                    ymd("2020-04-11"), 
                    by = "days")) %>%
  pivot_longer(cols = starts_with("M"), names_to = "Model", values_to = "R2") %>%
  mutate(Model = factor(Model, 
                        levels = c("Model1", "Model2", "Model3"),
                        labels = c("Model 1: lag8", "Model 2: lag11", "Model 3: lag11w"))) %>%
  ggplot(aes(x = Date, y = R2, color = Model, shape = Model)) +
  geom_point(size = 2) +
  scale_color_manual(values = c("Model 1: lag8" = "blue", "Model 2: lag11" = "orange", "Model 3: lag11w" = "green") ) + 
  annotate(geom = "text", 
           label = c(paste("pooled-R2 (Model 1: lag8)=", 
                           round(sur.slm_lag8_all$R2[1], 4)),
                     paste("pooled-R2 (Model 2: lag11)=", 
                           round(sur.slm_lag11_all$R2[1], 4)),
                     paste("pooled-R2 (Model 3: lag11w)=", 
                           round(sur.slm_lag11w_all$R2[1], 4))),
           x = as_date("2020-04-01"),
           y = c(0.3, 0.25, 0.20),
           family = "serif",
           size = 5) +
  theme_bw() +
  theme(legend.position = "bottom")
```

Discussion {#discussion}
===================

Table \ref{tab:summary-best-model} presents a summary of the results of Model 1 (*lag8*). Recall that two coefficients were constrained and are estimated only for the first equation of the system, and thus are assumed to be constant over time. These are the coefficients corresponding to GDP per capita ($p\leq0.10$) and percentage of older adults ($p\leq0.05$). The sign of the coefficient for GDP per capita is positive, which indicates that wealthier regions tend to have a higher incidence of COVID-19. This is in line with the idea that the epidemic started earlier in wealthier places due to their connections to a globalized world. The sign of the coefficient for percentage of older adults, on the other hand, is negative. As previously discussed, the level of social contact of older adults even under normal circumstances tends to be lower than for younger people. As a consequence, places with larger populations of older adults appear to have a natural level of social distancing in place. It is important to note that this does not detract from evidence that older adults are more vulnerable individually and in institutional settings, where their case mortality rates are perhaps the highest of all age groups. Instead, this result indicates that their presence in the community at large tends to depress transmission of the virus.

Of the two other control variables, the coefficient of population density is significant at $p\leq0.05$ in 12, at $p\leq0.10$ in 3 equations, and not significant in 15. The coefficient for transit is significant at $p\leq0.10$ in 20 equations, and of those, significant at $p\leq0.05$ in 18 equations. The next four variables are environmental factors. The coefficient for humidity is significant at $p\leq0.19$ in 20 equations, and of those, significant at $p\leq0.05$ in 15 equations. Of the environmental variables, temperature is the only variable that has significant coefficients in every equation at $p\leq0.05$. Finally, sunshine has significant coefficients at $p\leq0.05$ in 24 equations.

```{r summary-best-model, echo=FALSE, cache = TRUE}
# Copy best model
best_model <- sur.slm_lag8

data.frame(Coefficient = c(best_model$coefficients, 
                           best_model$deltas), 
           sd = c(best_model$rest.se, 
                  best_model$deltas.se)) %>%
  rownames_to_column(var = "Variable") %>%
  mutate(zscore = Coefficient/sd,
         pval = 2 * pnorm(-abs(zscore)),
         Variable = c("Intercept", 
                      "log(GDPpc)", 
                      "log(Older)",
                      "log(Density)",
                      "Transit",
                      "log(Humidity)",
                      "log(Temperature)",
                      "log(Sunshine)",
                      rep(c("Intercept",
                            "log(Density)", 
                            "Transit", 
                            "log(Humidity)", 
                            "log(Temperature)",
                            "log(Sunshine)"),
                          29),
                      rep("Spatially lagged y (rho)",
                          30)),
         Variable = factor(Variable, 
                           levels = c("Intercept", 
                                      "log(GDPpc)", 
                                      "log(Older)", 
                                      "log(Density)",
                                      "Transit", 
                                      "log(Humidity)",
                                      "log(Temperature)",
                                      "log(Sunshine)",
                                      "Spatially lagged y (rho)"))) %>%
  group_by(Variable) %>%
  summarize(Min = min(Coefficient), 
            Mean = mean(Coefficient), 
            Max = max(Coefficient),
            pns = sum(pval > 0.10), # Number of coefficients with p > 0.1
            p10 = sum(pval <= 0.10) - sum(pval <= 0.05), # Number of coefficients with  0.1 <= pval < 0.05
            p05 = sum(pval <= 0.05)) %>% # Number of coefficients with pval <=5 
  mutate(Note = c("Non-constrained", 
                  "Constrained", 
                  "Constrained", 
                  "Non-constrained", 
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained")) %>%
  kable(#"latex",
        "html",
        booktabs = TRUE,
        digits = 3,
        col.names = c("Variable", "Min", "Mean", "Max", "p > 0.10", "0.10 <= p < 0.05", "p <= 0.05", "Note"),
        caption = "\\label{tab:summary-best-model}Summary of estimation results of best model (lag11: lagged 11-day moving average)") %>%
  #kable_styling(latex_options = c("striped", "scale_down")) %>%
  kable_styling(bootstrap_options = c("striped", "condensed")) %>%
  add_header_above(c(" " = 1, "Estimates" = 3, "Significance" = 3, " " = 1)) %>%
  footnote(general = c("Significance: This is the number of coefficients with p-values as indicated",
                       "Non-constrained: coefficient was allowed to vary across equations",
                       "Constrained: coefficient as constant across equations"))
```

```{r summary-best-model-all, include=FALSE, eval=FALSE}
best_model_all <- sur.slm_lag11

data.frame(Coefficient = c(best_model_all$coefficients, 
                           # Copy best model
                           best_model_all$deltas), 
           sd = c(best_model_all$rest.se, 
                  best_model_all$deltas.se)) %>%
  rownames_to_column(var = "Variable") %>%
  mutate(zscore = Coefficient/sd,
         pval = 2 * pnorm(-abs(zscore)),
         Variable = c(rep(c("Intercept", 
                            "log(GDPpc)", 
                            "log(Older)",
                            "log(Density)",
                            "Transit",
                            "log(Humidity)",
                            "log(Temperature)",
                            "log(Sunshine)"),
                          30),
                      rep("Spatially lagged y (rho)",
                          30)),
         Variable = factor(Variable, 
                           levels = c("Intercept", 
                                      "log(GDPpc)", 
                                      "log(Older)", 
                                      "log(Density)",
                                      "Transit", 
                                      "log(Humidity)",
                                      "log(Temperature)",
                                      "log(Sunshine)",
                                      "Spatially lagged y (rho)"))) %>%
  group_by(Variable) %>%
  summarize(Min = min(Coefficient), 
            Mean = mean(Coefficient), 
            Max = max(Coefficient),
            pns = sum(pval > 0.10), # Number of coefficients with p > 0.1
            p10 = sum(pval <= 0.10) - sum(pval <= 0.05), # Number of coefficients with  0.1 <= pval < 0.05
            p05 = sum(pval <= 0.05)) %>% # Number of coefficients with pval <=5 
  mutate(Note = c("Non-constrained", 
                  "Constrained", 
                  "Constrained", 
                  "Non-constrained", 
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained")) %>%
  kable(#"latex",
    "html",
    booktabs = TRUE,
    digits = 3,
    col.names = c("Variable", "Min", "Mean", "Max", "p > 0.10", "0.10 <= p < 0.05", "p <= 0.05", "Note"),
    caption = "\\label{tab:summary-best-model}Summary of estimation results of best model (lag11: lagged 11-day moving average)") %>%
  #kable_styling(latex_options = c("striped", "scale_down")) %>%
  kable_styling(bootstrap_options = c("striped", "condensed")) %>%
  add_header_above(c(" " = 1, "Estimates" = 3, "Significance" = 3, " " = 1)) %>%
  footnote(general = c("Significance: This is the number of coefficients with p-values as indicated",
                       "Non-constrained: coefficient was allowed to vary across equations",
                       "Constrained: coefficient as constant across equations"))
```

```{r function-for-plotting-coefficients, include=FALSE}
# Define a function for plotting the coefficients over time including their confidence intervals and significance
plot_coeff <- function(model = model, n = n){
  data.frame(Date = seq(ymd("2020-03-13"), 
                        ymd("2020-04-11"), 
                        by = "days"),
             Beta = matrix(model$coefficients[-c(2,3)],ncol = T)[n,],
             sd = matrix(model$rest.se[-c(2,3)], ncol = T)[n,]) %>%
    mutate(zscore = Beta/sd,
           pval = 2 * pnorm(-abs(zscore)),
           CI95low = Beta - qnorm(0.975, mean = 0, sd = 1) * sd,
           CI95up = Beta + qnorm(0.975, mean = 0, sd = 1) * sd,
           CI90low = Beta - qnorm(0.95, mean = 0, sd = 1) * sd,
           CI90up = Beta + qnorm(0.95, mean = 0, sd = 1) * sd,
           pval = case_when(pval <= 0.05 ~ "p <= 0.05",
                            pval > 0.05 & pval <= 0.10 ~ "0.05 < p <= 0.10",
                            TRUE ~ "p > 0.10")) %>%
    ggplot(aes(x = Date, y = Beta, color = pval)) +
    geom_rect(xmin = as_date("2020-03-14"), xmax = as_date("2020-03-30"),
              ymin = -Inf, ymax = Inf, alpha = 0.010, fill = "yellow", color = NA) +
    geom_rect(xmin = as_date("2020-03-30"), xmax = as_date("2020-04-12"),
              ymin = -Inf, ymax = Inf, alpha = 0.010, fill = "orange", color = NA) +
    geom_rect(xmin = as_date("2020-04-12"), xmax = as_date("2020-04-30"),
              ymin = -Inf, ymax = Inf, alpha = 0.010, fill = "yellow", color = NA) +
  geom_point(size = 1) +
    geom_errorbar(aes(x = Date,
                      ymin = CI90low,
                      ymax = CI90up), 
                  size = 0.6,
                  width = 0.8) +
    geom_linerange(aes(x = Date,
                       ymin = CI95low,
                       ymax = CI95up),
                   size = 0.4) +
    scale_color_manual(values = c("0.05 < p <= 0.10" = "orange", "p <= 0.05" = "red", "p > 0.10" = "gray")) +  
    geom_hline(yintercept = 0) +
    theme_bw()
  }
```
  
```{r function-extract-legend, include=FALSE}
# Define a function to extract a legend from a ggplot
#extract legend
#https://github.com/hadley/ggplot2/wiki/Share-a-legend-between-two-ggplot2-graphs
g_legend<-function(a.gplot){
  tmp <- ggplot_gtable(ggplot_build(a.gplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)}
```

To better understand the results, we proceed to plot the coefficients in their temporal sequence. At this point it is worth recalling that the state of emergency went into effect on March 14. In the following figures, the periods of time indicated in yellow starting on March 14 correspond to the state of emergency, with only essential travel and selected industrial activities allowed; the period of time in orange was the stricter lockdown when only essential travel was allowed. 

We begin our discussion with the evolution of the spatial autocorrelation coefficient ($\rho$) in Figure \ref{fig:results-time-1} (left panel). We notice that the magnitude of the spatial autocorrelation coefficient $\rho$ declines over the period under analysis, and is not significant for some days. This is an interesting result: immediately prior to the declaration of the state of emergency, there appears to have been a strong inter-provincial contagion effect. Keeping in mind that the incubation period ranges between 2 and 12 days with a median of 5, it is reasonable to expect that the effect of the lockdown will be observed with some delay. Indeed, as seen in the figure, the autocorrelation coefficient remains high for several days, then begins to decline around March 23, and continues to weaken over time. At the end of the period under examination, the strength of this effect is much diminished and we would expect that under full compliance with strict lockdown conditions (meaning no inter-provincial mobility) the spatial contagion effect would be zero - as seems to be the case.

The intercept (right panel in Figure \ref{fig:results-time-1}) is indicative of the variation of the incidence of COVID-19, other things being equal. Here we see that at the incidence declines somewhat immediately after the state of emergency, only to begin increasing again over time. Then, the incidence declines again after the stricter lockdown and rebounds to a lower level by April 11.

```{r results-time-1, cache = TRUE, fig.height= 4, echo=FALSE, fig.cap="\\label{fig:results-time-1}Temporal evolution of the spatial autocorrelation coefficient (rho) and the intercept of the model; dots are the point estimates and vertical lines are 95% confidence intervals. In yellow is the period after the declaration of the state of emergency, and in orange is the period when only essential activities were allowed."}
rho.plot <- data.frame(Date = seq(ymd("2020-03-13"), 
                      ymd("2020-04-11"), 
                      by = "days"),
           Beta = best_model$deltas,
           sd = best_model$deltas.se) %>%
  mutate(zscore = Beta/sd,
         pval = 2 * pnorm(-abs(zscore)),
           CI95low = Beta - qnorm(0.975, mean = 0, sd = 1) * sd,
           CI95up = Beta + qnorm(0.975, mean = 0, sd = 1) * sd,
           CI90low = Beta - qnorm(0.95, mean = 0, sd = 1) * sd,
           CI90up = Beta + qnorm(0.95, mean = 0, sd = 1) * sd,
         pval = case_when(pval <= 0.05 ~ "p <= 0.05",
                          pval > 0.05 & pval <= 0.10 ~ "0.05 < p <= 0.10",
                         TRUE ~ "p > 0.10")) %>%
  ggplot(aes(x = Date, y = Beta, color = pval)) +
  geom_rect(xmin = as_date("2020-03-14"), xmax = as_date("2020-03-30"),
              ymin = -Inf, ymax = Inf, alpha = 0.010, fill = "yellow", color = NA) +
  geom_rect(xmin = as_date("2020-03-30"), xmax = as_date("2020-04-12"),
            ymin = -Inf, ymax = Inf, alpha = 0.010, fill = "orange", color = NA) +
  geom_rect(xmin = as_date("2020-04-12"), xmax = as_date("2020-04-30"),
              ymin = -Inf, ymax = Inf, alpha = 0.010, fill = "yellow", color = NA) +
  geom_point(size = 1) +
  geom_errorbar(aes(x = Date,
                     ymin = CI90low,
                    ymax = CI90up),
                 size = 0.6,
                width = 0.8) +
  geom_linerange(aes(x = Date,
                     ymin = CI95low,
                    ymax = CI95up),
                 size = 0.4) +
  scale_color_manual(values = c("0.05 < p <= 0.10" = "orange", "p <= 0.05" = "red", "p > 0.10" = "gray")) +  
  geom_hline(yintercept = 0) +
  theme_bw() +
  ylab("rho") +
  theme(legend.position = "bottom")

common_legend <- g_legend(rho.plot)

# Plot without legend
rho.plot <- rho.plot +
   theme(legend.position = "none") 

#Intercept
intercept.plot <- plot_coeff(model = best_model, n = 1) + 
  ylab("Beta (Intercept)") +
  theme(legend.position = "none")

# Define layout for plot
lay <- rbind(c(1, 2),
             c(3, 3))

grid.arrange(rho.plot, intercept.plot, common_legend, 
             layout_matrix = lay,
             heights = c(5, 1))
```

Figure \ref{fig:beta-controls-time} shows the temporal evolution of the coefficients for the two control variables that were not fixed over time, i.e., $\log(Density)$ (left panel) and $Transit$ (right panel). 

In Section \ref{background} we had anticipated a positive sign for the coefficient of density, and indeed, at the beginning of the period the coefficient is positive, albeit not significant, and then remains mostly non-significant for the earlier part of the lockdown. We are somewhat surprised by the way this coefficient turns significant _and_ negative in the later part of the lockdown, after April 1st. This effect, we surmise, is the result of risk compensation, a situation where people adapt their behavior according to the *perceived* level of risk, becoming more careful when the perceived risk is higher and viceversa [e.g., @Noland1995perceived; @Richens2000risk; @Phillips2011risk]. Consequently, residents in high density regions may perceive the risk of infection as being higher, and adapt their behavior accordingly - while the opposite may be true of residents in low density regions. The coefficient for Transit is positive, as expected, but with very wide confidence intervals, and in fact not significant in the earlier part of the period.

```{r beta-controls-time, echo=FALSE,  fig.height= 4, fig.cap="\\label{fig:beta-controls-time}Temporal evolution of coefficient for the control variables; dots are the point estimates and vertical lines are 95% confidence intervals. In yellow is the period after the declaration of the state of emergency, and in orange is the period when only essential activities were allowed.", cache = TRUE}
#Density:
density.plot <- plot_coeff(model = best_model, n = 2) + 
  ylab("Beta (log[Density]))") +
  theme(legend.position = "none")

# Transit
transit.plot <- plot_coeff(model = best_model, n = 3) + 
  ylab("Beta (Transit)") +
  theme(legend.position = "none")

# Define layout for plot
lay <- rbind(c(1, 2),
             c(3, 3))

grid.arrange(density.plot, transit.plot, common_legend, 
             layout_matrix = lay,
             heights = c(5, 1))
```

The evolution of the coefficients for the three environmental variables is shown in Figure \ref{fig:beta-environmental-time}. Despite a mostly positive simple correlation with incidence (see Figure \ref{fig:daily-correlations}) once that we control for other factors, humidity has a negative association with incidence of COVID-19 in Spain (top-right panel). This is in line with the literature that describes the lower viability and transmission of different viruses at higher levels of humidity. The coefficients for temperature (top-right panel) are consistently negative and this variable is, besides the intercept, the only one with significant coefficients in all equations. The range of variation of this coefficient during the period examined is approximately between -1 and -2, although it is important to recall that these values should not be interpreted directly as effects; more on this below. Finally the plot for the coefficients associated with hours of sunshine (bottom panel) is more ambiguous: prior to the lockdown, the coefficient was negative, but not significant. However, five days into the lockdown, the coefficient becomes significant and *positive*. This result stands in contrast to previous findings regarding influenza, where more hours of sunlight reduced the strength and duration of epidemic durations [@Yu2013characterization]. A difference with previous studies is the temporal scale of the analysis: where Yu et al. [-@Yu2013characterization] use monthly averages, we use daily data for a much shorter period of time. The positive sign of sunshine may well be another instance of behavioral adaptations, whereby compliance with lockdown orders weakens on sunny days.

```{r beta-environmental-time, echo=FALSE, fig.height= 7, fig.cap="\\label{fig:beta-environmental-time}Temporal evolution of coefficient for the environmental variables; dots are the point estimates and vertical lines are 95% confidence intervals. In yellow is the period after the declaration of the state of emergency, and in orange is the period when only essential activities were allowed.", cache = TRUE}
# Humidity
humidity.plot <- plot_coeff(model = best_model, n = 4) + 
  ylab("Beta (log[Humidity])") +
  theme(legend.position = "none")

# Temperature
temperature.plot <- plot_coeff(model = best_model, n = 5) + 
  ylab("Beta (log[Temperature])") +
  theme(legend.position = "none")

# Sunshine
sunshine.plot <- plot_coeff(model = best_model, n = 6) + 
  ylab("Beta (log[Sunshine])") +
  theme(legend.position = "none")

# Define layout for plot
lay <- rbind(c(1, 1, 2, 2),
             c(1, 1, 2, 2),
             c(NA, 3, 3, NA),
             c(NA, 3, 3, NA),
             c(NA, 4, 4, NA))

grid.arrange(humidity.plot, temperature.plot, sunshine.plot, common_legend, 
             layout_matrix = lay,
             heights = c(4, 4, 4, 4, 1))

```

```{r impacts, evalcho=FALSE, include=FALSE}
Wsur <- as(spdep::listw2mat(listw), "CsparseMatrix")
trsur <- spatialreg::trW(Wsur, type = "MC")
impacts.best_model <- impactspsur(best_model, tr = trsur,R = 1000)

impacts.best_model[[1]]
impacts.best_model[[2]]
impacts.best_model[[3]]

myimpact <- matrix(0,ncol = 5, nrow = 29)
for (i in 2:30){
  myimpact[i-1,] <- impacts.best_model[[i]]$res$direct
}

matplot(myimpact, type='l')

```

```{r calculate-impacts-best-model, echo=FALSE}
# Calculate impacts
Wsur <- as(spdep::listw2mat(listw), "CsparseMatrix")
trsur <- spatialreg::trW(Wsur, type = "MC")
best_model.impacts <- impactspsur(best_model, tr = trsur,R = 1000)

# Organize in dataframe
best_model.impacts <- data.frame(Impact = unlist(best_model.impacts)) %>%
  # Identify point estimates of effects for each variable and simulated values 
  mutate(Variable = c(c(rep(c("log(GDPpc)", 
                              "log(Older)",
                              "log(Density)",
                              "Transit",
                              "log(Humidity)",
                              "log(Temperature)",
                              "log(Sunshine)"),
                            3),
                        rep("junk", 21000)),
                      rep(c(rep(c("log(Density)",
                                  "Transit",
                                  "log(Humidity)",
                                  "log(Temperature)",
                                  "log(Sunshine)"),
                                3),
                            rep("junk", 15000)), # These are the simulated values
                          29))) %>%
  filter(Variable != "junk") %>% # Remove all simulations, keep only point estimates
  # Add type of effect and date
  mutate(Type = c(rep(c("Direct", 
                        "Indirect", 
                        "Total"),
                      each = 7),
                  rep(rep(c("Direct", 
                            "Indirect", 
                            "Total"),
                          each = 5),
                      29)),
         Date = c(rep(ymd("2020-03-13"),
                      21),
                  rep(seq(ymd("2020-03-14"),
                          ymd("2020-04-11"), 
                          by = "days"), 
                      each = 15))) %>%
  # Convert to factors
  mutate(Variable = factor(Variable,
                           levels = c("log(GDPpc)", 
                              "log(Older)",
                              "log(Density)",
                              "Transit",
                              "log(Humidity)",
                              "log(Temperature)",
                              "log(Sunshine)")),
         Type = factor(Type,
                       levels = c("Direct", 
                            "Indirect", 
                            "Total")))
```

The preceding discussion helps to establish the inferential contributions of the analysis, and indicate which variables display significant statistical associations with incidence of COVID-19. The remaining question is, what are the implications. 

As discussed in Section \ref{methods} the effect of a variable is not clear from its coefficient alone, since a change to a variable in a province influences, via the contagion effect, its neighbors. For this reason, the appropriate way to estimate the effects is to calculate both the own effect and the effect due to contagion, or in other words the direct and indirect effects, respectively. The total effect is the sum of the two. A summary of the effects appears in Table \ref{tab:summary-impacts-best-model}. All effects in the table are interpreted as percentage change in the incidence of COVID-19 as a consequence of a one percent change in the variable. The exception to this is Transit (which was not log-transformed). This variable instead represents the percentage change in incidence between provinces without and with mass transit systems. 

Two variables had temporally constrained coefficients. The estimated effect of GDP per capita is to increase the incidence of COVID-19 by 0.449% for each percent increase of this variable (in €1,000s). In our view, this is a measure of inertia, as provinces with higher GDP per capita where among the first to see exponential growth in the pandemic. Percentage of older adults has a negative effect, and each percent increase in this variable is associated with a relatively small reduction of the incidence of approximately 0.67%.

The temporal variation of the effects for the rest of the variables is shown in Figure \ref{fig:plot-impacts}. The largest positive direct effect is Transit, and the largest direct negative effects are temperature and humidity. The direct effects of these variables are as follows: for each percent point increase in temperature, there is between a 1% and 2% reduction in the incidence of the disease. This effect is compounded via contagion, as seen in the central panel in the figure, and the indirect effect can further reduce the incidence by up to 0.75%. The effect of humidity is also to reduce the incidence: each percent point of increase in humidity is associated with a reduction of up to 2% in incidence. With the addition of the indirect effect, the total effect of a 1% increase in humidity is to reduce incidence by up to 3%. As seen in the figure, the indirect (i.e., contagion) effects are stronger at the before and at the beginning of the lockdown period. Nonetheless, by the end of the period under study, the indirect effects have weakened considerably.

What do these effects mean? Under a situation of lockdown, inter-regional contagion is reduced, as expected, and the total effects of the variables tend towards their direct effects. In the first few days covered by our analysis the total effect of all variables is greater due to the spatial contagion effect. Contagion makes analysis and intervention more complex: the contagion effect essentially acts like a multiplier, whereby developments in each province spill over to their neighbors. Once the contagion effect has been tamed, each province can be "treated" independently from its neighbors.

It is important, before concluding our discussion, to highlight some limitations of this study. 

First, the analysis was conducted mostly under a situation of lockdown, and therefore, besides first days of the period examined, one must exercise caution when trying to extrapolate the findings to a situations without lockdown. Secondly, it is well known that there is in many countries substantial underreporting of cases of COVID-19 due to limited testing. In this case we do not suspect systematic provincial bias in reporting, and as long as the underreporting is consistent across units of analysis, we do not expect biased results; it is still important, however, to remain aware that the number of true cases is likely higher. Finally, we defined neighborhoods based on adjacency. It would be interesting to compare adjacency to other connectivity criteria, for instance based on domestic transportation instrastructure and services. We flag this as a matter for future research.

```{r summary-impacts-best-model, echo=FALSE}
rbind(best_model.impacts %>%
  filter(Type == "Direct") %>%
  group_by(Variable) %>%
  summarize(Min = min(Impact), 
            Mean = mean(Impact), 
            Max = max(Impact)) %>% # Number of coefficients with pval <=5 
  mutate(Note = c("Constrained", 
                  "Constrained", 
                  "Non-constrained", 
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained")),
  best_model.impacts %>%
  filter(Type == "Indirect") %>%
  group_by(Variable) %>%
  summarize(Min = min(Impact), 
            Mean = mean(Impact), 
            Max = max(Impact)) %>% # Number of coefficients with pval <=5 
  mutate(Note = c("Constrained", 
                  "Constrained", 
                  "Non-constrained", 
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained")),
  best_model.impacts %>%
  filter(Type == "Total") %>%
  group_by(Variable) %>%
  summarize(Min = min(Impact), 
            Mean = mean(Impact), 
            Max = max(Impact)) %>% # Number of coefficients with pval <=5 
  mutate(Note = c("Constrained", 
                  "Constrained", 
                  "Non-constrained", 
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained",
                  "Non-constrained"))) %>%
  kable(#"latex",
        "html",
        booktabs = TRUE,
        digits = 3,
        col.names = c("Variable", "Min", "Mean", "Max", "Note"),
        caption = "\\label{tab:summary-impacts-best-model}Summary of direct, indirect, and total effects according to best model (lag8: lagged 8-day moving average)") %>%
  #kable_styling(latex_options = c("striped")) %>%
  kable_styling(bootstrap_options = c("striped", "condensed")) %>%
  pack_rows("Direct Effects", 1, 7) %>%
  pack_rows("Indirect Effects", 8, 14) %>%
  pack_rows("Total Effects", 15, 21) %>%
  footnote(general = c("Non-constrained: coefficient was allowed to vary across equations",
                       "Constrained: coefficient as constant across equations"))
```

```{r plot-impacts, echo=FALSE, fig.cap="\\label{fig:plot-impacts}Temporal evolution of direct, indirect, and total effects by date."}
best_model.impacts %>%
  filter(Variable != "log(GDPpc)", 
         Variable != "log(Older)") %>%
  ggplot(aes(x = Date, 
             y = Impact, 
             color = Variable)) +
  geom_line(size = 0.8) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90),
        legend.position = "bottom") +
  facet_wrap(~ Type, ncol = 3)

```

Concluding Remarks {#conclusion}
===================

In this paper we presented a spatio-temporal analysis of incidence of COVID-19 in Spain. The analysis covers a 30-day period that begins immediately before the state of emergency was declared in the country. The focus of the research has been on the environmental correlates of incidence of the disease. There is a need for more empirical evidence, as policy makers, public health practitioners, and the public begin planning for the months ahead at this early stage of the pandemic.

Our results offer strong support for the hypothesis that incidence of COVID-19 at the population level is lower at higher temperatures and levels of humidity: the estimated effect is a reduction in the neighborhood of 3% percent in incidence per each 1% increase in temperature, and a 3% reduction in incidence per 1% increase in humidity *under conditions of contagion*. These reductions are lower when contagion has ceased (i.e., due to lockdown conditions). The question here seems to be whether these environmental variables can yield a bigger reduction of *more* cases, or a smaller reduction of *fewer* cases. 

Our control variables also offer some interesting insights. In particular, there is evidence of behavioral adaptations during lockdown in the form of risk compensation (density) and compliance with the lockdown (sunshine). These results offer a cautionary tale with regards to the effectiveness of the lockdown in more dense areas, and also the implications for compliance with stay-at-home orders as the northern hemisphere moves towards summer and more hours of sunshine during the day. If risk compensation is a factor, then efforts should be made to reduce or eliminate risk compensation in less densely populated regions.

A key aspect of the analysis using spatial SUR models is that we were able to model incidence of COVID-19 as an interregional contagion process. Here, we find that the strength of the contagion effect was dramatically reduced by the lockdown.

Needless to say, the analysis presented here is at the level of population health. For this reason, the analysis does not make any claims with respect to the effect of ultraviolet light on the virus, but rather about transmission of the virus in the population. For example, the analysis does not imply that the virus moves less effectively in places where more people live in close proximity to each other, but rather that humans are more contagious when they feel safe in less dense regions. Similarly, more sunshine does not mean that the virus thrives, but rather that humans are more contagious to each other when their behavior adapts to this environmental condition.

Some directions for future research include investigating other modelling frameworks, such as geographically and temporally weighted regression and/or space-time conditional autoregressive models. In addition, the environmental variables examined here relate to meteorological conditions only, and did not include other environmental factors that may incide in the transmission of the virus, such as pollution. These other factors should be incorporated in future studies.

Acknowledgments {#acknowledgments .unnumbered}
==========

Fernando A. López is supported by the program Groups of Excellence of the Region of Murcia, Fundación Séneca, Science and Technology Agency of the region of Murcia project 19884/GERM/15. Tatiane Menezes, Renata Cavalcanti, and Maira Galdino da Rocha Pitta are supported by Brazil's Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (Fundação Capes). Antonio Páez is not supported by Canada's Research Councils. The following `R` packages were used in the course of this investigation and the authors wish to acknowledge their developers: `aemet` [@Rodriguez2018aemet], `ggthemes` [@Arnold2019], `gridExtra` [@Auguie2017gridextra], `kableExtra` [@Zhu2019], `knitr` [@Xie2014; @Xie2015], `lubridate` [@Grolemund2011dates], `plm` [@Millo2017robust], `rticles` [@Allaire2020], `sf` [@Pebesma2018], `spdep` [@Bivand2013], spsur [@Angulo2020spsur] `tidyverse` [@Wickham2019], `units` [@Pebesma2016].
  
References {#references .unnumbered}
==========