02-read-and-reshape.Rmd


# Reading and reshaping data {#read-and-reshape}

## Introduction

Experimental data can be recorded and stored in different ways. Anything that is not digital (e.g. notes in an physical labbook) has to be converted, before it can be used in a computer. Nowadays, most information is already in a digital format and stored in a file. This can be a text file, an excel file, or a file generated by a piece of equipment.
In R, the main structure for storing and processing data is a 'dataframe' (a modernized version of the dataframe is a 'tibble', which is available when the tidyverse package is used). The dataset `mtcars` that comes with R is a dataframe. We can check the type of a structure or variable by using the function `class()`. This is a very helpful function to learn about the class of an object:

```{r}
class(mtcars)
```

The dataframe itself can be printed by just typing its name. To look only at the first rows we can use the function `head()`:

```{r}
head(mtcars)
```

When we read experimental data from a file, this implies that we transfer the information to a dataframe in R. A dataframe can hold several types of data. Let's first look at what different types of data you may encounter.

## Types of data

All the information that is recorded during an experiment can be considered as data. Details of the experimental approach or setup are called metadata, and the data that is measured is the raw data. Data visualization primarily deals with the raw, measured data. Yet, adding data on the experimental design or other types of metadata can be important for the interpretation of the data.
The raw data mostly consist of numbers, but these are often accompanied by labels for experimental conditions or to identify objects. All this information is data, but they are clearly of different types. Let's have a look at the most common types of data.

Measurements usually result in quantitative data consisting of numbers. For instance when an optical density is measured with a spectrophotometer or when temperature is determined with a thermometer. This kind of data is called _quantitative and continuous_ data, since it can have any number ranging from minus to plus infinity. Another type of quantitive data is _quantitative and discrete_ data and it consists only of natural numbers. An example is the number of colonies on an agar plate or the number of replicates.
There is also data that cannot be expressed in numbers and we call that _qualitative_ data. For instance different experimental categories, e.g. 'control' and 'treated'. This is also known as nominal data (ordinal data is not treated as it is not used in this book).
It is important to make the distinction between different types of data and tell R how to treat the data. In some cases a number stored in a file can be a category and it is important to treat the number as a category and not as quantitative data. We will see that R stores information about the type of data in the example below.

The variable x to which we assigned a value of 1 is:
```{r}
x <- 1
class(x)
```
We can convert this to a 'factor', which means that it is no longer a number but qualitative data:
```{r}
x <- as.factor(x)
class(x)
```
And therefore, this will give an error:
```{r}
x + 1
```
When we convert it back to a number, it works:
```{r}
as.numeric(x) + 1
```
This simple example illustrates the difference between quantitative and qualitative data and it shows that we can change the data type in R.


## Reading data

Reading, or loading data is the transfer of information from a file to the memory of R where it is stored as a dataframe. Rstudio supports 'point-and-click' loading of data from its menu (File > Import Dataset > ...). This is a convenient way of loading data. Since the aim is to perform all steps in a script, including data loading, I'll explain how functions are used to load data from the console.

### Loading data from a text or csv file

Before we can read the file, we need to make sure that we can locate the file. In RStudio you can select the directory (folder) from the menu: Session > Set Working Directory > Choose Directory...

If you are running an R script and the data is in the same directory as the script, you can go to the menu of Rstudio and select: Session > Set Working Directory > To Source File Location

When the directory is properly set, you can read the file. A common file format is the 'comma separated values' or CSV format. Here we load a CSV file that was obtained from [fpbase.org](www.fpbase.org) and contains the excitation and emission data of Green Fluorescent Protein. This file and all other example data are available at github: [https://github.com/JoachimGoedhart/DataViz-protocols](https://github.com/JoachimGoedhart/DataViz-protocols)
The function `read_csv()` is used to read the file and the data is assigned to a dataframe called 'df':
```{r}
df <- read.csv('FPbase_Spectra.csv')
```

To check whether the loading was successful, we can look at the first lines of the dataframe with the function `head()`:

```{r}
head(df)
```
This data has several columns, each containing quantitative data. Empty cells that do not have any data will be displayed as 'NA'. Note that this is different from '0'.

The 'delimiter' of a file is a character that separates the different fields of data. In a CSV file, as the name implies, this is a comma. However, other characters such as semicolons or tabs are often used as delimiters in text files. The function 'read.delim()' can be used to load a text file with a delimiter that can be specified, and therefore this function is more flexible for loading of data. For the CSV file, it would be used like this:

```{r}
df <- read.delim('FPbase_Spectra.csv', sep = ",")
```


### Loading data from a URL

When the CSV file is available online, e.g. in a data repository or on Github, it can be loaded by providng the URL:
```{r read-URL-zenodo}
df <- read.csv('https://zenodo.org/record/2545922/files/FRET-efficiency_mTq2.csv')
head(df)
```


### Retrieving data from Excel
Suppose we have an excel file with multiple tabs and we would like to access the data for mNeonGreen. To import the correct data into a dataframe use:
```{r}
df <- readxl::read_excel('FPbase_Spectra.xlsx', sheet = 'mNeonGreen')
head(df)
```
Again, 'NA' indicates that no data is available. Since the `read_excel()` is a funtion from the 'tidyverse' the data is stored in a tibble. This can be converted to an ordinary dataframe:
```{r}
df <- as.data.frame(df)
class(df)
```

Now that we have the data loaded, we can generate a plot. For simplicity I use the `qplot()` function. The more flexible `ggplot()` function will be introduced later. We need to supply the name of the dataframe, the column for the x-axis data and the column that is used for the y-axis. Note that two of the column names have a space and to properly indicate the name of the column we need to enclose the name with backtics (`).

```{r}
qplot(data=df,x=Wavelength, y=`mNeonGreen EX`)
```

A plot with lines instead of dots can be made by supplying this alternative 'geometry':
```{r}
qplot(data=df,x=Wavelength, y=`mNeonGreen EX`, geom='line')
```

The plots show the excitation spectrum of mNeonGreen, and it can be inferred that the protein is maximally excited near 500 nm.
In its current shape, the dataframe is not suitable for simultaneously plotting the excitation and emission spectrum. To do that, we need to reshape the data and this will be the topic of the [Reshaping data](#tidy-data) section.

### Retrieving data from multiple files

When the data is spread over multiple files, it is useful to read these files and combine them into one dataframe. In this example we have the data from three different conditions, each is an individual CSV file. First we create a list with the files based on a pattern. In this case the relevant files contain the string `S1P.csv`:
```{r}
filelist = list.files(pattern="*S1P.csv")
filelist
```

Then we use the function `map()` to perform the function `read.csv()` for each of the files and we store the result in a new dataframe 'df_input_list'. 
```{r}
df_input_list <- map(filelist, read.csv)
```

The result is a 'nested' dataframe, which is a dataframe with dataframes. Instead of having three separate dataframes, we want a single dataframe but it should have a label that reflects the condition. The labels are based on the filenames. We can use `str_replace()` to remove the extension of the filenames:

```{r}
names(df_input_list) <- str_replace(filelist, pattern = ".csv", replacement = "")
```

After this, we merge the dataframes and create a column 'id' that has the label with the filename:
```{r}
df <- bind_rows(df_input_list, .id = "id")
head(df)
```
This dataframe contains all the relevant information, but it is not tidy yet. We'll discuss how to convert this dataframe into a tidy format. We can save this dataframe for later:

```{r}
df %>% write.csv('df_S1P_combined.csv', row.names=FALSE)
```


## Reshaping data {#tidy-data}

Data is often recorded in tables or spreadsheets. Columns are typically used for different conditions (indicated in a header) and each data cell contains a measured value. Although this format makes perfect sense for humans, it is less suitable for analysis and visualization in R. Instead of the tabular, or wide, format, the functions from the tidyverse package work with data in a 'tidy' format. The benefit of tidy data is that it is a consistent way to structure datasets, facilitating data manipulation and visualization [Wickham, 2014](https://doi.org/10.18637/jss.v059.i10). In other words, this format simplifies downstream processing and visualization.
In this section, I will show how data can be converted from spreadsheet format to a long, tidy format. This step is needed to prepare the data for visualization with `ggplot()` which is also part of the tidyverse package. I will use the nomenclature that is used in the original publication by Hadley Wickham.

Before we start, a quick warning that I have been struggling with the concept of tidy data. Probably, because I was very much used to collect, process and summarize data in spreadsheets. In addition, I am used to read and present data in a tabular format. It is important to realize that data in the tidy format contains exactly the same information as non-tidy, spreadsheet data, but it is structured in a different way. In fact we can switch back and forth between the two formats with functions that are provided by R.

### Quantitative data, discrete conditions

Let's say that you have measured cell sizes under a number of different experimental conditions and this is stored in an excel spreadsheet. Let's load the data:

```{r}
df <- readxl::read_excel('Length-wide.xls')
df
```

First, we will replace the spaces in the column names. Although this is not strictly necessary, it simplifies handling of the data. We use `names()` to get the names of the dataframe and `str_replace()` to replace the space by an underscore:

```{r}
names(df) <- str_replace(names(df), pattern = " ", replacement = "_")
head(df)
```

Now that the column names are fixed, we can restructure the data. The aim is to create a dataframe with one column that specifies the conditions and one column with all the measured values. There is a long history of packages and functions that can be used for restructuring (`gather()`, `melt()`). Here we use the most modern function `pivot_longer()` from the tidyverse package. We need to specify the dataframe, which columns to restructure (specified by `everything()` here) and the names of the new columns. The current column names will be transferred to a column that is named "condition" and all values will be transferred to a column named "size".

```{r}
df_tidy <- pivot_longer(df, cols = everything(), names_to = "Condition", values_to = "Size")
head(df_tidy)
```
The benefit of this format is that is is now clear what the numbers are.
The most important requirement for tidy data is that each variable occupies only a single column and that each row is an observation.
Let's save the data in a csv file:

```{r}
df_tidy %>% write.csv('Length-tidy.csv')
```


### Multiple discrete conditions

Here, we deal with a more complex spreadsheet that holds data of multiple replicates and two experimental conditions. Note that the data from multiple conditions can be stored in different ways and here we only treat one way. Especially for these kind of datasets, the tidy format is a better, cleaner structure. We will load the data from a repository:

```{r}
df_multiheader <- read.csv('https://zenodo.org/record/4056966/files/Data-with-replicates.csv')
head(df_multiheader)
```

The first row lists the experimental condition and the second row identifies biological replicates. Now, when this is loaded as an ordinary CSV, the first row is the header, but the second row is treated as data. Let's load the data without flagging a header. We also add the `stringsAsFactors = FALSE` to make sure that the data is loaded as characters `<chr>` and not as factors `<fctr>`. The difference is not  obvious, but we run into problems later when we want to convert all the values into actual numbers (which we will do this at the very last step):

```{r}
df_multiheader <- read.csv("https://zenodo.org/record/4056966/files/Data-with-replicates.csv", header = FALSE, stringsAsFactors = FALSE)
head(df_multiheader)
```

We'll load the first row as a vector that contains the name of each column. To this end we select the first row of the dataframe with the brackets `[1,]`. The result is a dataframe and to turn this into a vector with strings we use `unlist()`:

```{r}
first_row <- df_multiheader[1,] %>% 
  unlist(use.names=FALSE)
first_row
```

We repeat this for the second row:

```{r}
second_row <- df_multiheader[2,] %>% 
  unlist(use.names=FALSE)
second_row
```

Next, row 1 and row 2 are removed from the dataframe, keeping only the data:

```{r}
df <- df_multiheader[-c(1:2),]
head(df)
```

The labels of the conditions and replicates are combined by pasting them together with an underscore to separate the labes. The result is a single vector with unique labels:

```{r}
combined_labels <- paste(first_row, second_row, sep="_")
combined_labels
```

Now, we can add these labels as column names to the dataframe:

```{r}
colnames(df) <- combined_labels
head(df)
```

To convert this dataframe into a tidy format we use the `pivot_longer()` function, exactly like we did in the previous example:

```{r}
df_tidy <- pivot_longer(df, cols = everything(), names_to = "combined_labels", values_to = "Size")
head(df_tidy)
```

The dataframe is tidy now, but we need to split the conditions from the replicates in the first column with combined labels:

```{r}
df_tidy <- df_tidy %>% separate(combined_labels, c('Treatment', 'Replicate'))
head(df_tidy)
```

There is still one problem that we need to fix. The values in de the column 'Size' are characters `<chr>`, which means these are strings. Let's convert the strings to actual numbers:

```{r}
df_tidy %>% mutate(Size = as.numeric(Size))
```

This kind of data, acquired at different conditions with different replicas is ideally suited for a SuperPlot [(Lord et al., 2000)](https://doi.org/10.1083/jcb.202001064). An example of this kind of data visualization is [Protocol 2](#protocol-2)

### Double quantitative data

An example of quantitative continuous data for two variables is when measurements are performed at different concentrations, times or wavelengths. The latter example we have encountered before when spectral data was loaded. Let's look again at that data and convert it to tidy format.

```{r}
df <- readxl::read_excel('FPbase_Spectra.xlsx', sheet = 'mNeonGreen')
head(df)
```

Both the Emission (EM) and excitation (EX) data are acquired as a function of wavelength. To convert this data to tidy format, we need to keep a column with wavelength data and we need another column with the spectral data. To achieve this, we will modify the mNeonGreen data and keep the Wavelength data as a column:

```{r}
df_tidy <- pivot_longer(df, cols = -Wavelength, names_to = "sample", values_to = "intensity")
head(df_tidy)
```

We can now plot the two spectra, which was not possible before the conversion:

```{r}
qplot(data=df_tidy,x=Wavelength, y=intensity, color=sample, geom='line')
```

A more elaborate example of plotting spectra is given in [Protocol 1](#protocol-1)

### Data from multiple files

In an earlier example, we have combined the data from multiple files into a single dataframe. Let's first load it:

```{r}
df <- read.csv('df_S1P_combined.csv')
head(df)
```

The data is still in a wide format and all the data that is in a column that starts with 'Cell' needs to be combined into a single column. The other columns `id` and `Time` need to be excluded from this operation and this is achieved with using the `-` sign:

```{r}
df_tidy <- pivot_longer(df, cols = -c(id, Time), names_to = "object", values_to = "activity")
head(df_tidy)
```

If desired, the column 'id' can be renamed and/or split: 

```{r}
df_tidy <- df_tidy %>% separate(id, c('Condition', 'Treatment'))
head(df_tidy)
```

This dataframe is perfectly tidy, but there's one little improvement that improves the sorting of the objects. The number of cells runs from 1 to 32. When it is sorted, Cell.1 will be followed by Cell.10:

```{r}
sort(unique(df_tidy$object))
```

To correct this, we need a 0 preceding the single digit numbers, e.g. Cell.01. To do that, we first split the object column into two columns, using the dot as a separator:

```{r}
df_tidy <- df_tidy %>% separate("object", c("object", "number"), sep="\\.")
```

Then, we fill up all the number to two digits by adding a 0 in front of all the single digit numbers with the function `str_pad()`:

```{r}
df_tidy <- df_tidy %>% mutate(number=str_pad(number, 2, pad = "0"))
```

We can merge the two columns back together and I changed the seperator to a space:

```{r}
df_tidy <- df_tidy %>% unite("object", c("object", "number"), sep=" ")
head(df_tidy)
```

When the sorting is repeated it looks better:

```{r}
sort(unique(df_tidy$object))
```

We will save this tidy dataframe for later use:

```{r}
df_tidy %>% write.csv("df_S1P_combined_tidy.csv", row.names = FALSE)
```


### Data in 96-wells format {data-in-96-wells-format}

Data measured on samples in multiwell plates are often stored in a structure that resembles the plate layout. As an example, we use here a dataset from a luciferase experiment, measured with a 96-well plate luminescence reader. The data is stored in a sheet named 'Results' in an xlsx file. The cells in which the data is stored are located in F21:Q28. By selecting these cells, only the 96 values that were measured are read:

```{r}
df <- readxl::read_excel('DualLuc_example_data.xlsx', sheet = 'Results', range = "F21:Q28", col_names = F)
```

The table-like layout of the data can be changed into a list of 96 values:

```{r}
data_as_list <- df %>% unlist(use.names = F)
```

It is essential to know how the table is converted and this is done by reading the data from the first column, top to bottom, than the second column, etc.
Knowing this, we can define the wells to which the data belongs, which would be A1, B1, ... , G12, H12.

```{r}
column <- rep(1:12, each=8)
row <- rep(LETTERS[1:8],12)
Well <- paste0(row,column)
Well
```

We can now generate a dataframe that lists the wells and the values, which is a (luminescence) intensity. We can also add two additional columns that list the row and column information:

```{r}
df_tidy_wells <- data.frame(column, row, Well, Intensity=data_as_list)
head(df_tidy_wells)
```

```{r}
df_tidy_wells %>% write.csv("df_tidy_wells.csv")
```

This concludes the conversion of data from the plate layout into a tidy format. The instructions explained here are used in [Protocol 4](#protocol-4)