-
Notifications
You must be signed in to change notification settings - Fork 1
/
ggplot2.Rmd
503 lines (371 loc) · 13.6 KB
/
ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
# ggplot2 package
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=5, fig.height=4,
echo=TRUE, warning=FALSE, message=FALSE)
```
* Graphing package inspired by the **G**rammar of **G**raphics work of Leland Wilkinson.
* A tool that enables to concisely describe the components of a graphic.
* Why ggplot2 ?
+ Flexible
+ Customizable
+ Pretty !
+ Well documented
* We will see:
* Scatter plots
* Box plots
* Bar plots
* Histograms
* How to save plots
* Volcano plots
## Getting started
Load package:
```{r, eval=TRUE, echo=TRUE, warning=F, message=F}
library(ggplot2)
```
* All ggplots start with a **base layer** created with the **ggplot()** function:
```{r, eval=F, echo=TRUE}
ggplot(data=dataframe, mapping=aes(x=column1, y=column2))
```
*The base layer is setting the grounds but NOT plotting anything:*
* You then add a layer (with the **+** sign) that describes what kind of plot you want:
* geom_point()
* geom_bar()
* geom_histogram()
* geom_boxplot()
* ...
* And then you will add **one layer at a time** to add more features to your plot!
## Scatter plot
```{r, eval=F}
# Example of a scatter plot: add the geom_point() layer
ggplot(data=dataframe, mapping=aes(x=column1, y=column2)) + geom_point()
```
* Example of a simple scatter plot:
```{r, eval=TRUE}
# Create a data frame
df1 <- data.frame(sample1=rnorm(200), sample2=rnorm(200))
# Plot !
ggplot(data=df1 , mapping=aes(x=sample1, y=sample2)) +
geom_point()
```
* Add **layers** to that object to customize the plot:
* **ggtitle** to add a title
* **geom_vline** to add a vertical line
* etc.
```{r, eval=TRUE}
ggplot(data= df1 , mapping=aes(x=sample1, y=sample2)) +
geom_point() +
ggtitle(label="my first ggplot") +
geom_vline(xintercept=0)
```
Bookmark this [ggplot2 reference](https://ggplot2.tidyverse.org/reference/) and the [cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) for some of the ggplot2 options.
* You can save the plot **in an object** at any time and add layers to that object:
```{r, eval=TRUE}
# Save in an object
p <- ggplot(data= df1 , mapping=aes(x=sample1, y=sample2)) +
geom_point()
# Show p: write p in the console and ENTER
p
# Add layers to that object
p + ggtitle(label="my first ggplot")
```
* What is inside the **aes** (aesthetics)function ?
* Anything that varies according to your data !
* Columns with values to be plotted.
* Columns with which you want to, for example, color the points.
Color all points in red (not depending on the data):
```{r, eval=TRUE}
ggplot(data=df1 , mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red")
```
Color the points according to another column in the data frame:
```{r, eval=TRUE}
# Build a new data frame df2 from df1:
# add a column "grouping" containing "yes" and "no" values.
df2 <- data.frame(df1,
grouping=rep(c("yes", "no"), c(80, 120)))
# Plot and add the color parameter in the aes():
pscat <- ggplot(data=df2, mapping=aes(x=sample1, y=sample2, color=grouping)) +
geom_point()
pscat
```
Note that the legend is automatically added to the plot!
**HANDS-ON**
We will now use the **rock** dataset from the `datasets` package. It contains the measurements on 48 rock samples from a petroleum reservoir.
* Create a scatter plot of **area** versus **peri** (perimeter).
* Color the points according to column **perm** of **rock**
* Create a horizontal line representing the **median perimeter**.
<details>
<summary>
*Answer*
</summary>
```{r}
# Create a scatter plot of **area** versus **peri** (perimeter).
ggplot(data=rock, mapping=aes(x=area, y=peri)) + geom_point()
# Color the points according to column **perm** of **rock**
ggplot(data=rock, mapping=aes(x=area, y=peri, color=perm)) +
geom_point()
# Create a horizontal line representing the **median perimeter**.
ggplot(data=rock, mapping=aes(x=area, y=peri, color=perm)) +
geom_point() +
geom_hline(yintercept=median(rock$peri))
```
</details>
## Box plots
* Simple boxplot showing the data distribution of sample 1:
```{r, eval=TRUE}
ggplot(data=df2, mapping=aes(x="", y=sample1)) + geom_boxplot()
```
* Split the data into 2 boxes, depending on the **grouping** column:
```{r, eval=TRUE}
ggplot(data=df2, mapping=aes(x=grouping, y=sample1)) + geom_boxplot()
```
* What if you want to plot both sample1 and sample2 ?<br>
*You need to convert the data from a **wide** into a **long** format*
<br>
What is the **long** format ?<br>
One row **per observation/value**.
<img src="images/plots/wide2long.png" width="800">
Plotting both sample1 and sample2:
```{r, eval=FALSE}
# install package reshape2
install.packages("reshape2")
```
```{r, eval=TRUE, message=FALSE, warning=FALSE, error=FALSE}
# load package
library("reshape2")
# convert to long format
df_long <- melt(data=df2)
# all numeric values are organized into only one column: value
# plot:
ggplot(data=df_long, mapping=aes(x=variable, y=value)) +
geom_boxplot()
```
* What if now you also want to see the distribution of "yes" and "no" in both sample1 and sample2 ?<br>
*Integrate a parameter to the **aes()***: either *color* or *fill*.
```{r, eval=TRUE}
# Either color (color of the box border)
ggplot(data=df_long, mapping=aes(x=variable, y=value, color=grouping)) +
geom_boxplot()
```
```{r, eval=TRUE}
# Or fill (color inside the box)
ggplot(data=df_long, mapping=aes(x=variable, y=value, fill=grouping)) +
geom_boxplot()
```
Do you want to change the default colors?<br>
* Integrate either layer:
* **scale_color_manual()** for the boxes border color
* **scale_fill_manual()** for the boxes color (inside)
```{r, eval=TRUE}
pbox_fill <- ggplot(data=df_long, mapping=aes(x=variable, y=value, fill=grouping)) +
geom_boxplot() +
scale_fill_manual(values=c("slateblue2", "chocolate"))
pbox_fill
pbox_col <- ggplot(data=df_long, mapping=aes(x=variable, y=value, color=grouping)) +
geom_boxplot() +
scale_color_manual(values=c("slateblue2", "chocolate"))
pbox_col
```
**HANDS-ON**
Let's use the **CO2** dataset that represents the carbon dioxide uptake in grass plants:
* Create a boxplot that represents the **uptake** for each **Treatment**.
* Split each boxplot per **Type**. *Use either the color or the fill argument*.
* Move the legend to the bottom of the plot. You can get help from [this page](http://www.sthda.com/english/wiki/ggplot2-legend-easy-steps-to-change-the-position-and-the-appearance-of-a-graph-legend-in-r-software)
<details>
<summary>
*Answer*
</summary>
```{r}
# Create a boxplot that represents the **uptake** for each **Treatment**.
ggplot(data=CO2, mapping=aes(x=Treatment, y=uptake)) + geom_boxplot()
# Split each boxplot per **Type**.
ggplot(data=CO2, mapping=aes(x=Treatment, y=uptake, fill=Type)) +
geom_boxplot()
# Move the legend to the bottom of the plot.
ggplot(data=CO2, mapping=aes(x=Treatment, y=uptake, fill=Type)) +
geom_boxplot() +
theme(legend.position = "bottom")
```
</details>
## Bar plots
```{r, eval=TRUE}
# A simple bar plot
ggplot(data=df2, mapping=aes(x=grouping)) + geom_bar()
```
* Customize:
* **scale_x_discrete** is used to handle x-axis title and labels
* **coord_flip** swaps the x and y axis
```{r, eval=TRUE}
# Save the plot in the object "p"
pbar <- ggplot(data=df2, mapping=aes(x=grouping, fill=grouping)) +
geom_bar()
pbar
# Change x axis label with scale_x_discrete and change order of the bars:
p2 <- pbar + scale_x_discrete(name="counts of yes / no", limits=c("yes", "no"))
p2
# Swapping x and y axis with coord_flip():
p3 <- p2 + coord_flip()
p3
# Change fill
p4 <- p3 + scale_fill_manual(values=c("yellow", "cyan"))
p4
```
**HANDS-ON**
Let's use the **chickwts** dataset again:
* Create a barplot of the different **feed supplements**.
* Change the orientation of the x-axis labels (Look it up in [this post](https://rstudio-pubs-static.s3.amazonaws.com/32217_6e2c396729704050972fb84f0d58ee22.html) ).
<details>
<summary>
*Answer*
</summary>
```{r}
# Create a barplot of the different **feed supplements**.
ggplot(data=chickwts, mapping=aes(x=feed)) +
geom_bar()
# Change the orientation of the x-axis labels
ggplot(data=chickwts, mapping=aes(x=feed)) +
geom_bar() +
theme(axis.text.x=element_text(angle=45))
```
</details>
### Bar plots with error bars
We can create error bars on barplots.
<br>
Let's create a toy data set, that contains 7 independent qPCR measurements for 3 genes:
```{r, eval=TRUE}
pcr <- data.frame(Dkk1=c(18.2, 18.1, 17.8, 17.85, 18.6, 12.4, 10.7),
Pten=c(15.1,15.2, 15.0, 15.6, 15.3, 14.8, 15.9),
Tp53=c(9.1, 9.9, 9.25, 8.7, 8.8, 9.3, 7.8))
```
The height of the bar will represent the **average** qPCR measurement. The error bar will represent the **average - standard deviation** on the lowe part, and the **average + standard deviation** on the high part.
<br>
We need to create a data frame that **summarizes** these values:
```{r, eval=TRUE}
pcr_summary <- data.frame(average=apply(pcr, 2, mean),
standard_deviation=apply(pcr, 2, sd),
genes=colnames(pcr))
```
And now we can plot!
```{r, eval=TRUE}
ggplot(pcr_summary, aes(x=genes, y=average, fill=genes)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin=average-standard_deviation, ymax=average+standard_deviation), colour="black", width=.1)
```
## Histograms
Simple histogram on one sample (using the df2 data frame):
```{r, eval=TRUE}
ggplot(data=df1, mapping=aes(x=sample1)) + geom_histogram()
```
Histogram on more samples (using df_long):
```{r, eval=TRUE}
ggplot(data=df_long, mapping=aes(x=value)) + geom_histogram()
```
Split the data per sample ("variable" column that represents here the samples):
```{r, eval=TRUE}
ggplot(data=df_long, mapping=aes(x=value, fill=variable)) + geom_histogram()
```
By default, the histograms are **stacked**: change to position **dodge** (side by side):
```{r, eval=TRUE}
phist <- ggplot(data=df_long, mapping=aes(x=value, fill=variable)) +
geom_histogram(position='dodge')
phist
```
**HANDS-ON**
Going back to the **rock** dataset:
* Create a histogram of the rocks **perimeter**.
* Add a density plot to the histogram, following instructions from [this post](http://www.sthda.com/english/wiki/ggplot2-histogram-plot-quick-start-guide-r-software-and-data-visualization#add-mean-line-and-density-plot-on-the-histogram)
<details>
<summary>
*Answer*
</summary>
```{r}
# Create a histogram of the rocks **perimeter**.
ggplot(data=rock, mapping=aes(x=peri)) + geom_histogram()
# Add a density plot to the histogram
ggplot(data=rock, mapping=aes(x=peri)) +
geom_histogram(aes(y=..density..)) +
geom_density(alpha=.2, fill="lightblue")
```
</details>
## About themes
You can change the default global **theme** (background color, grid lines etc. all non-data display):
```{r, eval=TRUE, message=F, warning=F}
# go back to a previous plot
p <- ggplot(data=df_long, mapping=aes(x=value)) + geom_histogram()
# Try different themes
p + theme_bw()
p + theme_minimal()
p + theme_void()
p + theme_grey()
p + theme_dark()
p + theme_light()
```
## Saving plots in files
* The same as for regular plots applies:
```{r, eval=F, message=F, warning=F}
png("myggplot.png")
p
dev.off()
```
* You can also use the ggplot2 **ggsave** function:
```{r, eval=F, message=F, warning=F}
# By default, save the last plot that was produced
ggsave(filename="lastplot.png")
# You can pick which plot you want to save:
ggsave(filename="myplot.png", plot=p)
# Many different formats are available:
# "eps", "ps", "tex", "pdf", "jpeg", "tiff", "png", "bmp", "svg", "wmf"
ggsave(filename="myplot.ps", plot=p, device="ps")
# Change the height and width (and their unit):
ggsave(filename="myplot.pdf",
width = 20,
height = 20,
units = "cm")
```
* You can also organize several plots on one page
* One way is to use the **gridExtra** package:
* ncol, nrow: arrange plots in such number of columns and rows
```{r, eval=F}
install.packages("gridExtra")
```
```{r, eval=TRUE, message=F, error=F, warning=F}
# load package
library(gridExtra)
# 2 rows and 2 columns
grid.arrange(pscat, pbox_fill, pbar, phist, nrow=2, ncol=2)
```
```{r, fig.width=10, fig.height=4, eval=TRUE, message=F, warning=F}
# 1 row and 4 columns
grid.arrange(pscat, pbox_fill, pbar, phist, nrow=1, ncol=4)
```
Combine ggsave and grid.arrange:
```{r, eval=FALSE}
myplots <- grid.arrange(pscat, pbox_fill, pbar, phist, nrow=1, ncol=4)
ggsave(filename="mygridarrange.png", plot=myplots)
```
**HANDS-ON** (same as for the "base" plots!)
Go back to the previous plots you created (if you didn't save commands in an R script, you can refer to the *History* tab in the top-right panel):
* Save 1 plot of your choice in a **jpeg** file.
* Save 3 plots of your choice in a **pdf** file (one plot per page).
* Organize the same 3 plots in 1 row / 3 columns. Save the image in a **png** file. Play with the **width** (and perhaps also **height**) argument of `png()` until you are satisfied with the way the plot renders.
<details>
<summary>
*Answer*
</summary>
```{r}
# Save 1 plot of your choice in a **jpeg** file.
phisto <- ggplot(data=df_long, mapping=aes(x=value)) + geom_histogram()
ggsave(filename="mygridarrange.jpeg", plot=phisto)
# Save 3 plots of your choice in a **pdf** file (one plot per page).
pbarplot <- ggplot(data=chickwts, mapping=aes(x=feed)) + geom_bar()
pboxplot <- ggplot(data=CO2, mapping=aes(x=Treatment, y=uptake, fill=Type)) + geom_boxplot()
pdf("my3ggplot2.pdf")
phisto
pbarplot
pboxplot
dev.off()
# Organize the same 3 plots in 1 row / 3 columns.
my3ggplots <- grid.arrange(phisto, pbarplot, pboxplot, nrow=1, ncol=3)
ggsave(filename="my3ggplots.png", plot=my3ggplots, width=12)
```
</details>