analysis.Rmd

```{r analysis-setup, include=FALSE}
library(ggplot2)
source("r/render.R")
knitr::opts_chunk$set(eval = FALSE)
```

# Analysis {#analysis}

> First lesson: stick them with the pointy end.
>
> --- Jon Snow

Previous chapters focused on introducing Spark with R, getting you up to speed and encouraging you to try basic data analysis workflows. However, they have not properly introduced what data analysis means, especially with Spark. They presented the tools you will need throughout this book—tools that will help you spend more time learning and less time troubleshooting.

This<!--((("Spark using R", "data analysis", id="SURdataanal03")))--> chapter introduces tools and concepts to perform data analysis in Spark from R. Spoiler alert: these are the same tools you use with plain R! This is not a mere coincidence; rather, we want data scientists to live in a world where technology is hidden from them, where you can use the R packages you know and love, and they "just work" in Spark! Now, we are not quite there yet, but we are also not that far. Therefore, in this chapter you learn widely used R packages and practices to perform data analysis—`dplyr`, `ggplot2`, formulas, `rmarkdown`, and so on—which also happen to work in Spark.

[Chapter 4](#modeling) will focus on creating statistical models to predict, estimate, and describe datasets, but first, let’s get started with analysis!

## Overview {#analysis-overview}

In<!--((("data analysis", "overview of")))--> a data analysis project, the main goal is to understand what the data is trying to "tell us", hoping that it provides an answer to a specific question. Most data analysis projects follow a set of steps, as shown in Figure \@ref(fig:analysis-steps). 

As the diagram illustrates, we first _import_ data into our analysis stem, where we _wrangle_ it by trying different data transformations, such as aggregations. We then _visualize_ the data to help us perceive relationships and trends. To gain deeper insight, we can fit one or multiple statistical _models_ against sample data. This will help us find out whether the patterns hold true when new data is applied to them. Lastly, the results are communicated publicly or privately to colleagues and stakeholders.

```{r analysis-steps, echo=FALSE, out.width='auto', out.height='220pt', fig.cap='The general steps of a data analysis', fig.align = 'center', eval = TRUE}
render_nomnoml("
#direction: right
#edgeMargin: 4
#padding: 25
#fontSize: 18

[Import] -> [Understand]
[Understand |
  [Wrangle] -> [Visualize] 
  [Visualize] -> [Model]
  [Model] -> [Wrangle]
]
[Understand] -> [Communicate]
", "images/analysis-overview-diagram.png")
```

When working with not-large-scale datasets—as in datasets that fit in memory—we can perform all those steps from R, without using Spark. However, when data does not fit in memory or computation is simply too slow, we can slightly modify this approach by incorporating Spark. But how?

For data analysis, the ideal approach is to let Spark do what it's good at. Spark is a((("parallel execution"))) parallel computation engine that works at a large scale and provides a SQL engine and modeling libraries. You can use these to perform most of the same operations R performs. Such operations include data selection, transformation, and modeling. Additionally, Spark includes tools for performing specialized computational work like graph analysis, stream processing, and many others. For now, we will skip those non-rectangular datasets and present them in later chapters.

You can perform data _import_, _wrangling_, and _modeling_ within Spark. You can also partly do _visualization_ with Spark, which we cover later in this chapter. The idea is to use R to tell Spark what data operations to run, and then only bring the results into R. As<!--((("data analysis", "push compute, collect results principle")))--> illustrated in Figure \@ref(fig:analysis-approach), the ideal method _pushes compute_ to the Spark cluster and then _collects results_ into R.

```{r analysis-approach, echo=FALSE, out.width='auto', out.height='220pt', fig.cap='Spark computes while R collects results', fig.align = 'center', eval = TRUE}
render_nomnoml("
#direction: right
#arrowSize: 0.4
#lineWidth: 1
[R|
[<note>sparklyr
       dbi
       dplyr
       corrr
       ggplot2
       dbplot
       ...
rmarkdown]]->[<label> Push Compute]
[Push Compute]->[Spark|[<note>Spark SQL
                              Machine Learning
                              Graph Analysis
                              Spark Streaming
                              ...]]
[R]<-[<label> Collect Results]
[Collect Results]<-[Spark]
", "images/analysis-r-interface-to-spark-sql.png")
```

The `sparklyr` package<!--((("sparklyr package", "push compute, collect results principle")))--> aids in using the "push compute, collect results" principle. Most of its functions are wrappers on top of Spark API calls. This allows us to take advantage of Spark’s analysis components, instead of R’s. For example, when you need to fit a linear regression model, instead of using R’s familiar `lm()` function, you would use Spark’s `ml_linear_regression()` function. This R function then calls Spark to create this model. Figure \@ref(fig:analysis-scala) depicts this specific example.

```{r analysis-scala, echo=FALSE, out.width='auto', out.height='140pt', fig.cap='R functions call Spark functionality', fig.align = 'center', eval = TRUE}
render_nomnoml("
#direction: right
#padding: 12
#lineWidth: 1
#arrowSize: 0.4
[R | [<note>ml_linear_regression()]]->[<label>Invokes]
[Invokes]->[Scala | [<note>var lr = new LinearRegression()
                     var lrModel = lr.fit(training)]]
", "images/analysis-r-interface-to-spark-scala.png")
```

For<!--((("dplyr package", "sparklyr backend for")))--> more common data manipulation tasks, `sparklyr` provides a backend for `dplyr`. This<!--((("DataFrames", "using dplyr verbs with")))--> means you can use `dplyr` verbs with which you're already familiar in R, and then `sparklyr` and `dplyr` will translate those actions into Spark SQL statements, which are generally more compact and easier to read than SQL statements (see Figure \@ref(fig:analysis-sql)). So, if you are already familiar with R and `dplyr`, there is nothing new to learn. This might feel a bit anticlimactic—indeed, it is—but it’s also great since you can focus that energy on learning other skills required to do large-scale computing.

```{r analysis-sql, echo=FALSE, out.width='auto', out.height='140pt', fig.cap='dplyr writes SQL in Spark', fig.align = 'center', eval = TRUE}
render_nomnoml("
#direction: right
#padding: 12
#lineWidth: 1
#arrowSize: 0.4
[R | [<note>filter(data, y == 1)]]->[<label>Invokes]
[Invokes]->  [SQL | [<note>... WHERE y = 1 ...]]
", "images/analysis-dplyr-sql-translation.png")
```

To practice as you learn, the rest of this chapter’s code uses a single exercise that runs in the _local_ Spark master. This way, you can replicate the code on your personal computer. Make sure `sparklyr` is already working, which should be the case if you completed [Chapter 2](#starting).

This chapter will make use of packages that you might not have installed. So, first, make sure the following packages<!--((("ggplot2 package", "installing")))((("corr package", "installing")))((("dbplot package", "installing")))((("R Markdown")))--> are installed by running these commands:

```{r eval=FALSE, exercise=TRUE}
install.packages("ggplot2")
install.packages("corrr")
install.packages("dbplot")
install.packages("rmarkdown")
```

First, load the `sparklyr` and `dplyr` packages and then open a new _local_ connection.

```{r}
library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local", version = "2.3")
```

The environment is ready to be used, so our next task is to import data that we can later analyze.

## Import

When<!--((("data analysis", "importing data")))--> using Spark with R, you need to approach importing data differently. Usually, importing means that R will read files and load them into memory; when you are using Spark, the data is imported into Spark, not R. In Figure \@ref(fig:analysis-access), notice how the data source is connected to Spark instead of being connected to R.

```{r analysis-access, echo=FALSE, out.width='auto', out.height='160pt', fig.cap='Import data to Spark not R', fig.align = 'center', eval = TRUE}
render_nomnoml("
#direction: right
#padding: 14
#lineWidth: 1
#arrowSize: 0.4
[R]->[<label>Push Compute]
[Push Compute]->[Spark]
[R]<-[<label>Collect Results]
[Collect Results]<-[Spark]
[Spark]<-[<label> Import]
[Import]<-[Data Source]
", "images/analysis-import-data-to-spark.png")
```

**Note:** When you're performing analysis over large-scale datasets, the vast majority of the necessary data will already be available in your Spark cluster (which is usually made available to users via Hive tables or by accessing the file system directly). [Chapter 8](#data) will cover this extensively.

Rather than importing all data into Spark, you can request Spark to access the data source without importing it—this is a decision you should make based on speed and performance. Importing all of the data into the Spark session incurs a one-time up-front cost, since Spark needs to wait for the data to be loaded before analyzing it. If the data is not imported, you usually incur a cost with every Spark operation since Spark needs to retrieve a subset from the cluster’s storage, which is usually disk drives that happen to be much slower than reading from Spark’s memory. More on this topic will be covered in [Chapter 9](#tuning).

Let’s prime<!--((("commands", "copy_to()")))--> the session with some data by importing `mtcars` into Spark using `copy_to()`; you can also import data from distributed files in many different file formats, which we look at in [Chapter 8](#data).

```{r analysis-copy-to}
cars <- copy_to(sc, mtcars)
```

**Note:** When using real clusters, you should use `copy_to()` to transfer only small tables from R; large data transfers should be performed with specialized data transfer tools.

The data is now accessible to Spark and you can now apply transformations with ease; the next section covers how to wrangle data by running transformations inside Spark, using `dplyr`.

## Wrangle

Data wrangling<!--((("data analysis", "wrangling data", id="DAwrangl03")))((("data wrangling", id="dwrang03")))((("transformations", "data wrangling")))--> uses transformations to understand the data. It is often referred to as the process of transforming data from one "raw" data form into another format with the intent of making it more appropriate for data analysis.

Malformed<!--((("missing values")))--> or missing values and columns with multiple attributes are common data problems you might need to fix, since they prevent you from understanding your dataset. For example, a "name" field contains the last and first name of a customer. There are two attributes (first and last name) in a single column. To be usable, we need to _transform_ the "name" field, by _changing_ it into "first_name" and "last_name" fields. 

After the data is cleaned, you still need to understand the basics about its content. Other transformations such as aggregations can help with this task. For example, the result of requesting the average balance of all customers will return a single row and column. The value will be the average of all customers. That information will give us context when we see individual, or grouped, customer balances.

The main goal is to write the data transformations using R syntax as much as possible. This saves us from the cognitive cost of having to switch between multiple computer technologies to accomplish a single task. In this case, it is better to take advantage<!--((("dplyr package", "SQL statements using")))--> of `dplyr` instead of writing Spark SQL statements for data exploration.  

In the R environment, `cars` can be treated as if it were a local<!--((("DataFrames", "using dplyr verbs with")))--> DataFrame, so you can use `dplyr` verbs. For instance, we can find out the mean of all columns by using `summarise_all()`:

```{r analysis-summarize-all}
summarize_all(cars, mean)
```
```
# Source: spark<?> [?? x 11]
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81
```

While this code is exactly the same as the code you would run when using `dplyr` without Spark, a lot is happening under the hood. The data is _not_ being imported into R; instead, `dplyr` converts this task into SQL statements that are then sent to Spark. The<!--((("commands", "show_query()")))--> `show_query()` command makes it possible to peer into the SQL statement that `sparklyr` and `dplyr` created and sent to Spark. We<!--((("pipe (%&gt;%) operator")))--> can also use this time to introduce the pipe operator (`%>%`), a custom operator from<!--((("magrittr package")))--> the `magrittr` package that pipes a computation into the first argument of the next function, making your data analysis much easier to read:

```{r analysis-summarize-show-query}
summarize_all(cars, mean) %>%
  show_query()
```
```
<SQL>
SELECT AVG(`mpg`) AS `mpg`, AVG(`cyl`) AS `cyl`, AVG(`disp`) AS `disp`,
       AVG(`hp`) AS `hp`, AVG(`drat`) AS `drat`, AVG(`wt`) AS `wt`,
       AVG(`qsec`) AS `qsec`, AVG(`vs`) AS `vs`, AVG(`am`) AS `am`,
       AVG(`gear`) AS `gear`, AVG(`carb`) AS `carb`
FROM `mtcars`
```
As is evident, `dplyr` is much more concise than SQL, but rest assured, you will not need to see or understand SQL when using `dplyr`. Your focus can remain on obtaining insights from the data, as opposed to figuring out how to express a given set of transformations in SQL. Here is another example that groups the `cars` dataset by `transmission` type:

```{r analysis-group-summarize}
cars %>%
  mutate(transmission = ifelse(am == 0, "automatic", "manual")) %>%
  group_by(transmission) %>%
  summarise_all(mean)
```
```
# Source: spark<?> [?? x 12]
  transmission   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 automatic     17.1  6.95  290.  160.  3.29  3.77  18.2 0.368     0  3.21  2.74
2 manmual       24.4  5.08  144.  127.  4.05  2.41  17.4 0.538     1  4.38  2.92
```

Most of the<!--((("dplyr package", "data transformations")))--> data transformation operations made available by `dplyr` to work with local DataFrames are also available to use with a Spark connection. This means that you can focus on learning `dplyr` first and then reuse that skill when working with Spark. Chapter 5 from the book [_R for Data Science_](https://r4ds.had.co.nz/) by Hadley Wickham and Garrett Grolemund (O'Reilly) is a great resource to learn `dplyr` in depth. If proficiency with `dplyr` is not an issue for you, we recommend that you take some time to experiment with different `dplyr` functions against the `cars` table.

Sometimes, we might need to perform an operation not yet available through `dplyr` and `sparklyr`. Instead of downloading the data into R, there is usually a Hive function within Spark to accomplish what we need. The next section covers this scenario.

### Built-in Functions

Spark SQL is based on<!--((("Hive project", "SQL conventions")))--> Hive’s SQL conventions and functions, and it is possible to call all these functions using `dplyr` as well. This means that we can use any Spark SQL functions to accomplish operations that might not be available via `dplyr`. We can access the functions by calling them as if they were R functions. Instead of failing, `dplyr` passes functions it does not recognize as is to the query engine. This gives us a lot of flexibility on the functions we can use.

For instance, the `percentile()` function returns the exact percentile of a column in a group. The function expects a column name, and either a single percentile value or an array of percentile values. We can use this Spark SQL function from `dplyr`, as follows:

```{r analysis-percentile}
summarise(cars, mpg_percentile = percentile(mpg, 0.25))
```

```
# Source: spark<?> [?? x 1]
  mpg_percentile
           <dbl>
1           15.4
```

There is no `percentile()` function in R, so `dplyr` passes that portion of the code as-is to the resulting SQL query:

```{r analysis-percentile-query}
summarise(cars, mpg_percentile = percentile(mpg, 0.25)) %>%
  show_query()
```
```
<SQL>
SELECT percentile(`mpg`, 0.25) AS `mpg_percentile`
FROM `mtcars_remote`
```

To pass multiple values to `percentile()`, we can call another Hive function called `array()`. In this case, `array()` would work similarly to R’s `list()` function. We can pass multiple values separated by commas. The output from Spark is an array variable, which is imported into R as a list variable column:

```{r analysis-percentile-rename}
summarise(cars, mpg_percentile = percentile(mpg, array(0.25, 0.5, 0.75)))
```
```
# Source: spark<?> [?? x 1]
  mpg_percentile
  <list>        
1 <list [3]>   
```

You can use the `explode()` function to separate Spark’s array value results into their own record. To do this, use `explode()` within<!--((("commands", "mutate()")))--> a `mutate()` command, and pass the variable containing the results of the percentile operation:

```{r analysis-percentile-mutate}
summarise(cars, mpg_percentile = percentile(mpg, array(0.25, 0.5, 0.75))) %>%
  mutate(mpg_percentile = explode(mpg_percentile))
```
```
# Source: spark<?> [?? x 1]
  mpg_percentile
           <dbl>
1           15.4
2           19.2
3           22.8
```

We have included a comprehensive list of all the Hive functions in the section [Hive Functions](#hive-functions). Glance over them to get a sense of the wide range of operations that you can accomplish with them.<!--((("", startref="dwrang03")))-->

### Correlations

A<!--((("correlations, calculating")))--> very common exploration technique is to calculate and visualize correlations, which we often calculate to find out what kind of statistical relationship exists between paired sets of variables. Spark provides functions to<!--((("DataFrames", "correlations and")))--> calculate correlations across the entire dataset and returns the results to R as a DataFrame object:

```{r analysis-corrr-cars}
ml_corr(cars)
```

```
# A tibble: 11 x 11
      mpg    cyl   disp     hp    drat     wt    qsec
    <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
 1  1     -0.852 -0.848 -0.776  0.681  -0.868  0.419 
 2 -0.852  1      0.902  0.832 -0.700   0.782 -0.591 
 3 -0.848  0.902  1      0.791 -0.710   0.888 -0.434 
 4 -0.776  0.832  0.791  1     -0.449   0.659 -0.708 
 5  0.681 -0.700 -0.710 -0.449  1      -0.712  0.0912
 6 -0.868  0.782  0.888  0.659 -0.712   1     -0.175 
 7  0.419 -0.591 -0.434 -0.708  0.0912 -0.175  1     
 8  0.664 -0.811 -0.710 -0.723  0.440  -0.555  0.745 
 9  0.600 -0.523 -0.591 -0.243  0.713  -0.692 -0.230 
10  0.480 -0.493 -0.556 -0.126  0.700  -0.583 -0.213 
11 -0.551  0.527  0.395  0.750 -0.0908  0.428 -0.656 
# ... with 4 more variables: vs <dbl>, am <dbl>,
#   gear <dbl>, carb <dbl>
```

The `corrr` R package<!--((("corr package", "benefits of")))--> specializes in correlations. It contains friendly functions to prepare and visualize the results. Included inside the package is a backend for Spark, so when a Spark object is used in `corrr`, the actual computation also happens in Spark. In the background, the<!--((("commands", "correlate()")))--> `correlate()` function runs `sparklyr::ml_corr()`, so there is no need to collect any data into R prior to running the command:

```{r analysis-corrr-pairwise}
library(corrr)
correlate(cars, use = "pairwise.complete.obs", method = "pearson") 
```

```
# A tibble: 11 x 12
   rowname     mpg     cyl    disp      hp     drat      wt
   <chr>     <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
 1 mpg      NA      -0.852  -0.848  -0.776   0.681   -0.868
 2 cyl      -0.852  NA       0.902   0.832  -0.700    0.782
 3 disp     -0.848   0.902  NA       0.791  -0.710    0.888
 4 hp       -0.776   0.832   0.791  NA      -0.449    0.659
 5 drat      0.681  -0.700  -0.710  -0.449  NA       -0.712
 6 wt       -0.868   0.782   0.888   0.659  -0.712   NA    
 7 qsec      0.419  -0.591  -0.434  -0.708   0.0912  -0.175
 8 vs        0.664  -0.811  -0.710  -0.723   0.440   -0.555
 9 am        0.600  -0.523  -0.591  -0.243   0.713   -0.692
10 gear      0.480  -0.493  -0.556  -0.126   0.700   -0.583
11 carb     -0.551   0.527   0.395   0.750  -0.0908   0.428
# ... with 5 more variables: qsec <dbl>, vs <dbl>,
#   am <dbl>, gear <dbl>, carb <dbl>
```

We can pipe the results to other `corrr` functions. For example, the `shave()` function turns all of the duplicated results into `NAs`. Again, while this feels like standard R code using existing R packages, Spark is being used under the hood to perform the correlation.

Additionally, as shown in Figure \@ref(fig:analysis-corrr-rplot), the results can be easily visualized using<!--((("commands", "rplot()")))--> the `rplot()` function, as shown here:

```{r analysis-corrr-show}
correlate(cars, use = "pairwise.complete.obs", method = "pearson") %>%
  shave() %>%
  rplot()
```
```{r analysis-corrr-rplot-code, echo=FALSE}
correlate(cars, use = "pairwise.complete.obs", method = "pearson") %>%
  shave() %>%
  rplot(colours = c("#333333", "#999999")) +
    ggsave("images/analysis-corrr-rplot.png", width = 10, height = 5)
```
```{r analysis-corrr-rplot, eval = TRUE, out.width='800pt', out.height='400pt', fig.align = 'center', fig.cap = 'Using rplot() to visualize correlations', echo = FALSE}
render_image("images/analysis-corrr-rplot.png")
```

It is much easier to see which relationships are positive or negative: positive relationships are in gray, and negative relationships are black. The size of the circle indicates how significant their relationship is. The power of visualizing data is in how much easier it makes it for us to understand results. The next section expands on this step of the process.<!--((("", startref="DAwrangl03")))-->

## Visualize

Visualizations<!--((("visualizations", "applications for")))((("data analysis", "visualizing data", id="DAvisual03")))--> are a vital tool to help us find patterns in the data. It is easier for us to identify outliers in a dataset of 1,000 observations when plotted in a graph, as opposed to reading them from a list.

R<!--((("visualizations", "using R")))((("R computing language", "visualizations using")))--> is great at data visualizations. Its capabilities for creating plots are extended by the many R packages that focus on this analysis step. Unfortunately, the vast majority of R functions that create plots depend on the data already being in local memory within R, so they fail when using a remote table within Spark.

It is possible to create visualizations in R from data sources that exist in Spark. To understand how to do this, let’s first break down how computer programs build plots. To begin, a program takes the raw data and performs some sort of transformation. The transformed data is then mapped to a set of coordinates. Finally, the mapped values are drawn in a plot. Figure \@ref(fig:analysis-plot) summarizes each of the steps.

```{r analysis-plot, echo=FALSE, out.width='auto', out.height='80pt', fig.cap='Stages of an R plot', fig.align = 'center', eval = TRUE}
render_nomnoml("
#direction: right
#edgeMargin: 4
#padding: 25
#fontSize: 18
#lineWidth: 1
[Raw data] -> [Transformed Data]
[Transformed Data] -> [Map data to coordinates]
[Map data to coordinates] -> [Draw plot]
", "images/analysis-stages-of-a-plot.png")
```

In essence, the approach for visualizing is the same as in wrangling: push the computation to Spark, and then collect the results in R for plotting. As illustrated in Figure \@ref(fig:analysis-spark-plot), the heavy lifting of preparing the data, such as aggregating the data by groups or bins, can be done within Spark, and then the much smaller dataset can be collected into R. Inside R, the plot becomes a more basic operation. For example, for a histogram, the bins are calculated in Spark, and then plotted in R using a simple column plot, as opposed to a histogram plot, because there is no need for R to recalculate the bins. 

```{r analysis-spark-plot, echo=FALSE, out.width='auto', out.height='150pt', fig.cap='Plotting with Spark and R', fig.align = 'center', eval = TRUE}
render_nomnoml("
#direction: right
#edgeMargin: 4
#padding: 25
#fontSize: 18
[Spark |
  [Raw data] -> [Transformed Data]
]
[R |
  [Map to coordinates] -> [Draw plot]
]
[Spark] -> [R]
", "images/analysis-plotting-with-spark-and-r.png")
```

Let’s apply this conceptual model when using `ggplot2`.

### Using ggplot2

To<!--((("visualizations", "using ggplot2")))((("ggplot2 package", "visualizations using")))--> create a bar plot using `ggplot2`, we simply call a function:

```{r analysis-ggplot2-simple, eval = TRUE, out.width='500pt', out.height='400pt', fig.cap='Plotting inside R', fig.align = 'center'}
library(ggplot2)
ggplot(aes(as.factor(cyl), mpg), data = mtcars) + geom_col()
```

In this case, the `mtcars` raw data was _automatically_ transformed into three discrete aggregated numbers. Next, each result was mapped into an `x` and `y` plane. Then the plot was drawn. As R users, all of the stages of building the plot are conveniently abstracted for us.

In Spark, there<!--((("Apache Spark", "push compute, collect results principle")))--> are a couple of key steps when codifying the "push compute, collect results" approach. First, ensure that the transformation operations happen within Spark. In<!--((("commands", "group_by()")))((("commands", "summarise()")))((("commands", "collect()")))--> the example that follows, `group_by()` and `summarise()` will run inside Spark. The second is to bring the results back into R after the data has been transformed. Be sure to transform and then collect, in that order; if `collect()` is run first, R will try to ingest the entire dataset from Spark. Depending on the size of the data, collecting all of the data will slow down or can even bring down your system.

```{r analysis-ggplot2-simmarise}
car_group <- cars %>%
  group_by(cyl) %>%
  summarise(mpg = sum(mpg, na.rm = TRUE)) %>%
  collect() %>%
  print()
```
```
# A tibble: 3 x 2
    cyl   mpg
  <dbl> <dbl>
1     6  138.
2     4  293.
3     8  211.
```

In this example, now that the data has been preaggregated and collected into R, only three records are passed to the plotting function:

```{r analysis-ggplot2-groups}
ggplot(aes(as.factor(cyl), mpg), data = car_group) + 
  geom_col(fill = "#999999") + coord_flip()
```

Figure \@ref(fig:analysis-viz1) shows the resulting plot.

```{r, eval = FALSE, echo = FALSE}
ggplot(aes(as.factor(cyl), mpg), data = car_group) + 
  geom_col(fill = "#878787") +
  coord_flip() +
  geom_hline(yintercept = 0, size = 1, colour = "#333333") +
  ggsave("images/analysis-visualizations-1.png", width = 10, height = 3)
```
```{r analysis-viz1, eval = TRUE, fig.width=10, fig.height=5, fig.cap='Plot with aggregation in Spark', fig.align = 'center', echo = FALSE}
render_image("images/analysis-visualizations-1.png")
```

Any other `ggplot2` visualization can be made to work using this approach; however, this is beyond the scope of the book. Instead, we recommend that you read [_R Graphics Cookbook_](https://oreil.ly/bIF4), by Winston Chang (O'Reilly) to learn additional visualization techniques applicable to Spark. Now, to ease this transformation step before visualizing, the `dbplot` package provides a few ready-to-use visualizations that automate aggregation in Spark.

### Using dbplot

The `dbplot` package<!--((("visualizations", "using dbplot")))((("dbplot package", "visualizations using")))--> provides helper functions for plotting with remote data. The R code `dbplot` that's used to transform the data is written so that it can be translated into Spark. It then uses those results to create a graph using the `ggplot2` package where data transformation and plotting are both triggered by a single function.  

The `dbplot_histogram()` function<!--((("commands", "dbplot_histogram()")))((("histograms")))--> makes Spark calculate the bins and the count per bin and outputs a `ggplot` object, which we can further refine by adding more steps to the plot object. `dbplot_histogram()` also accepts a `binwidth` argument to control the range used to compute the bins:

```{r  analysis-dbplot-simple}
library(dbplot)

cars %>%
dbplot_histogram(mpg, binwidth = 3) +
labs(title = "MPG Distribution",
     subtitle = "Histogram over miles per gallon")
```

Figure \@ref(fig:analysis-visualizations-histogram) presents the resulting plot.

```{r analysis-dbplot-simple-plot, eval = FALSE, echo = FALSE}
db_compute_bins(cars, mpg, binwidth = 3) %>%
  ggplot() +
  geom_col(aes(mpg, count), fill = "#878787") +
  geom_hline(yintercept = 0, size = 1, colour = "#333333") +
  labs(title = "MPG Distribution", subtitle = "Histogram over miles per gallon") +
  ggsave("images/analysis-visualizations-histogram.png", width = 10, height = 5)
```
```{r analysis-visualizations-histogram, eval = TRUE, fig.width=10, fig.height=5, fig.cap='Histogram created by dbplot', fig.align = 'center', echo = FALSE}
render_image("images/analysis-visualizations-histogram.png")
```

Histograms provide a great way to analyze a single variable. To analyze two variables, a<!--((("scatter plots")))((("raster plots")))--> scatter or raster plot is commonly used.

Scatter plots are used to compare the relationship between two continuous variables. For example, a scatter plot will display the relationship between the weight of a car and its gas consumption. The plot in Figure \@ref(fig:analysis-point) shows that the higher the weight, the higher the gas consumption because the dots clump together into almost a line that goes from the upper left toward the lower right. Here's the code to generate the plot:

```{r analysis-scatter-code, eval = FALSE}
ggplot(aes(mpg, wt), data = mtcars) + 
  geom_point()
```
```{r analysis-scatter-rendeer, eval = FALSE, echo = FALSE}
ggplot(aes(mpg, wt), data = mtcars) + 
  geom_point() +
  labs(title = "Weigth over MPG", subtitle = "A scatter plot visualizing car weigth over miles per gallon") +
  geom_hline(yintercept = 0, size = 1, colour = "#333333") +
  scale_y_continuous(breaks = 0:5) +
  ggsave("images/analysis-point.png", width = 10, height = 5)
```
```{r analysis-point, eval = TRUE, fig.width=10, fig.height=5, fig.cap='Scatter plot example in Spark', fig.align = 'center', echo = FALSE}
render_image("images/analysis-point.png")
```

However, for scatter plots, no amount of "pushing the computation" to Spark will help with this problem because the data must be plotted in individual dots.

The best alternative is to find a plot type that represents the x/y relationship and concentration in a way that it is easy to perceive and to "physically" plot. The _raster_ plot might be the best answer. A raster plot returns a grid of x/y positions and the results of a given aggregation, usually represented by the color of the square.

You<!--((("commands", "dbplot_raster()")))--> can use `dbplot_raster()` to create a scatter-like plot in Spark, while only retrieving (collecting) a small subset of the remote dataset:

```{r analysis-raster-codee, eval = FALSE}
dbplot_raster(cars, mpg, wt, resolution = 16)
```
```{r  analysis-rasteer-render, eval = FALSE, echo = FALSE}
db_compute_raster2(cars, mpg, wt, resolution = 16) %>%
  ggplot() +
  scale_fill_gradientn(colours = grey.colors(10, start = 0.6, end = 0.2)) +
  geom_raster(aes(mpg, wt, fill = `n()`)) +
  scale_y_continuous(breaks = 0:5) +
  geom_hline(yintercept = 0, size = 1, colour = "#333333") +
  labs(title = "Weigth over MPG", subtitle = "A raster plot visualizing car weigth over miles per gallon using Spark") +
  ggsave("images/analysis-visualizations-raster.png", width = 10, height = 5)
```
```{r analysis-visualizations-raster, eval = TRUE, fig.width=10, fig.height=5, fig.cap='A raster plot using Spark', fig.align = 'center', echo = FALSE}
render_image("images/analysis-visualizations-raster.png")
```

As shown in \@ref(fig:analysis-visualizations-raster), the resulting plot returns a grid no bigger than 5 x 5. This limits the number of records that need to be collected into R to 25.

**Tip:** You<!--((("commands", "db_compute_bins()")))((("commands", "db_compute_boxplot()")))((("commands", "db_compute_raster()")))((("commands", "db_compute_count()")))--> can also use `dbplot` to retrieve the raw data and visualize by other means; to retrieve the aggregates, but not the plots, use `db_compute_bins()`, `db_compute_count()`, `db_compute_raster()`, and `db_compute_boxplot()`.

While visualizations are indispensable, you can complement data analysis using statistical models to gain even deeper insights into our data. The next section describes how we can prepare data for modeling with Spark.<!--((("", startref="DAvisual03")))-->

## Model

The<!--((("data analysis", "modeling")))((("modeling", "overview of")))--> next two chapters focus entirely on modeling, so rather than introducing modeling in too much detail in this chapter, we want to cover how to interact with models while doing data analysis.

First, an analysis project goes through many transformations and models to find the answer. That’s why the first data analysis diagram we introduced in Figure \@ref(fig:analysis-approach), illustrates a cycle of: visualizing, wrangling, and modeling—we know you don’t end with modeling, not in R, nor when using Spark.

Therefore, the ideal data analysis language enables you to quickly adjust over each wrangle-visualize-model iteration. Fortunately, this is the case when using Spark and R.

To illustrate how easy it is to iterate over wrangling and modeling in Spark, consider the following example. We will start by performing a linear regression against all features and predict miles per gallon:

```{r analysis-model-linear, eval = FALSE}
cars %>% 
  ml_linear_regression(mpg ~ .) %>%
  summary()
```
```
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
(Intercept)         cyl        disp          hp        drat          wt  
12.30337416 -0.11144048  0.01333524 -0.02148212  0.78711097 -3.71530393
      qsec          vs          am        gear        carb 
0.82104075  0.31776281  2.52022689  0.65541302 -0.19941925 

R-Squared: 0.869
Root Mean Squared Error: 2.147
```

At this point, it is very easy to experiment with different features, we can simply change the R formula from `mpg ~ .` to, say, `mpg ~ hp + cyl` to only use horsepower and cylinders as features:

```{r analysis-model-summary, eval = FALSE}
cars %>% 
  ml_linear_regression(mpg ~ hp + cyl) %>%
  summary()
```
```
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-4.4948 -2.4901 -0.1828  1.9777  7.2934 

Coefficients:
(Intercept)          hp         cyl 
 36.9083305  -0.0191217  -2.2646936 

R-Squared: 0.7407
Root Mean Squared Error: 3.021
```

Additionally, it is also very easy to iterate with other kinds of models. The following one replaces the linear model with a generalized linear model:

```{r analysis-model-glm, eval = FALSE}
cars %>% 
  ml_generalized_linear_regression(mpg ~ hp + cyl) %>%
  summary()
```
```
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-4.4948 -2.4901 -0.1828  1.9777  7.2934 

Coefficients:
(Intercept)          hp         cyl 
 36.9083305  -0.0191217  -2.2646936 

(Dispersion parameter for gaussian family taken to be 10.06809)

   Null  deviance: 1126.05 on 31 degress of freedom
Residual deviance: 291.975 on 29 degrees of freedom
AIC: 169.56
```

Usually, before fitting a model you would need to use multiple `dplyr` transformations to get it ready to be consumed by a model. To make sure the model can be fitted as efficiently as possible, you should cache your dataset before fitting it, as described next.

### Caching

The examples<!--((("caching", "example of")))--> in this chapter are built using a very small dataset. In real-life scenarios, large amounts of data are used for models. If the data needs to be transformed first, the volume of the data could exact a heavy toll on the Spark session. Before fitting the models, it is a good idea to save the results of all the transformations in a new table loaded in Spark memory.

The `compute()` <!--((("commands", "compute()")))-->command can take the end of a `dplyr` command and save the results to Spark memory:

```{r analysis-caching-compute, eval = FALSE}
cached_cars <- cars %>% 
  mutate(cyl = paste0("cyl_", cyl)) %>%
  compute("cached_cars")
```

```{r analysis-caching-linear, eval = FALSE}
cached_cars %>%
  ml_linear_regression(mpg ~ .) %>%
  summary()
```
```
Deviance Residuals:
     Min       1Q   Median       3Q      Max 
-3.47339 -1.37936 -0.06554  1.05105  4.39057 

Coefficients:
(Intercept) cyl_cyl_8.0 cyl_cyl_4.0        disp          hp        drat
16.15953652  3.29774653  1.66030673  0.01391241 -0.04612835  0.02635025
          wt        qsec          vs          am       gear        carb 
 -3.80624757  0.64695710  1.74738689  2.61726546 0.76402917  0.50935118  

R-Squared: 0.8816
Root Mean Squared Error: 2.041
```

As more insights are gained from the data, more questions might be raised. That is why we expect to iterate through the data wrangle, visualize, and model cycle multiple times. Each iteration should provide incremental insights into what the data is "telling us". There will be a point when we reach a satisfactory level of understanding. It is at this point that we will be ready to share the results of the analysis. This is the topic of the next section.

## Communicate

It<!--((("data analysis", "communicating results")))((("communicating results")))--> is important to clearly communicate the analysis results—as important as the analysis work itself! The public, colleagues, or stakeholders need to understand what you found out and how.

To<!--((("R Markdown")))--> communicate effectively, we need to use artifacts such as reports and presentations; these are common output formats that we can create in R, using _R Markdown_.

R Markdown documents allow you to weave narrative text and code together. The variety of output formats provides a very compelling reason to learn and use R Markdown. There are many available output formats like HTML, PDF, PowerPoint, Word, web slides, websites, books, and so on.

Most<!--((("knitr package")))((("bookdown package")))--> of these outputs are available in the core R packages of R Markdown: `knitr` and `rmarkdown`. You can extend R Markdown with other R packages. For example, this book was written using R Markdown thanks to an extension provided by the `bookdown` package. The best resource to delve deeper into R Markdown is the official book.^[Xie Allaire G (2018). _R Markdown: The Definite Guide_, 1st edition. CRC Press.]

In R Markdown, one singular artifact could potentially be rendered in different formats. For example, you could render the same report in HTML or as a PDF file by changing a setting within the report itself. Conversely, multiple types of artifacts could be rendered as the same output. For example, a presentation deck and a report could be rendered in HTML.

Creating a new R Markdown report that uses Spark as a computer engine is easy. At the top, R Markdown expects a YAML header. The first and last lines are three consecutive dashes (`---`). The content in between the dashes varies depending on the type of document. The only required field in the YAML header is the `output` value. R Markdown needs to know what kind of output it needs to render your report into. This YAML header is called _frontmatter_. Following the frontmatter are sections of code, called _code chunks_. These code chunks can be interlaced with the narratives. There is nothing particularly interesting to note when using Spark with R Markdown; it is just business as usual.

Since an R Markdown document is self-contained and meant to be reproducible, before rendering documents, we should first disconnect from Spark to free resources:

```{r analysis-disconnect}
spark_disconnect(sc)
```

The following example shows how easy it is to create a fully reproducible report that uses Spark to process large-scale datasets. The narrative, code, and, most important, the output of the code is recorded in the resulting HTML file. You can copy and paste the following code in a file. Save the file with a _.Rmd_ extension, and choose whatever name you would like:

````markdown
---
title: "mtcars analysis"
output:
  html_document: 
    fig_width: 6 
    fig_height: 3
---
`r ''````{r, setup, include = FALSE}
library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local", version = "2.3")
cars <- copy_to(sc, mtcars)
```

## Visualize
Aggregate data in Spark, visualize in R.
`r ''````{r  fig.align='center', warning=FALSE}
library(ggplot2)
cars %>%
  group_by(cyl) %>% summarise(mpg = mean(mpg)) %>%
  ggplot(aes(cyl, mpg)) + geom_bar(stat="identity")
```

## Model
The selected model was a simple linear regression that 
uses the weight as the predictor of MPG

`r ''````{r}
cars %>%
  ml_linear_regression(wt ~ mpg) %>%
  summary()
```
`r ''````{r, include = FALSE}
spark_disconnect(sc)
```
````

To((("commands", "render()"))) knit this report, save the file with a _.Rmd_ extension such as _report.Rmd_, and run `render()` from R. The output should look like that shown in Figure \@ref(fig:visualize-analysis-communicate).

```{r analysis-communicate-rmd, eval=FALSE}
rmarkdown::render("report.Rmd")
```
```{r visualize-analysis-communicate, eval = TRUE, fig.width=10, fig.height=5, fig.cap='R Markdown HTML output', fig.align = 'center', echo = FALSE}
render_image("images/analysis-rmarkdown.png")
```

You can now easily share this report, and viewers of won't need Spark or R to read and consume its contents; it’s just a self-contained HTML file, trivial to open in any browser.

It is also common to distill insights of a report into many other output formats. Switching is quite easy: in the top frontmatter, change the `output` option to `powerpoint_presentation`, `pdf_document`, `word_document`, or the like. Or you can even produce multiple output formats from the same report:

```
---
title: "mtcars analysis"
output:
  word_document: default
  pdf_document: default
  powerpoint_presentation: default
---
```

The result will be a PowerPoint presentation, a Word document, and a PDF. All of the same information that was displayed in the original HTML report is computed in Spark and rendered in R. You'll likely need to edit the PowerPoint template or the output of the code chunks. 

This minimal example shows how easy it is to go from one format to another. Of course, it will take some more editing on the R user's side to make sure the slides contain only the pertinent information. The main point is that it does not require that you learn a different markup or code conventions to switch from one artifact to another.<!--((("", startref="SURdataanal03")))-->

## Recap

This chapter presented a solid introduction to data analysis with R and Spark. Many of the techniques presented looked quite similar to using just R and no Spark, which, while anticlimactic, is the right design to help users already familiar with R to easily transition to Spark. For users unfamiliar with R, this chapter also served as a very brief introduction to some of the most popular (and useful!) packages available in R.

It should now be quite obvious that, together, R and Spark are a powerful combination—a large-scale computing platform, along with an incredibly robust ecosystem of R packages, makes for an ideal analysis platform.

While doing analysis in Spark with R, remember to push computation to Spark and focus on collecting results in R. This paradigm should set up a successful approach to data manipulation, visualization and communication through sharing your results in a variety of outputs.

[Chapter 4](#modeling) will dive deeper into how to build statistical models in Spark using a much more interesting dataset (what’s more interesting than dating datasets?). You will also learn many more techniques that we did not even mention in the brief modeling section from this chapter.