-
Notifications
You must be signed in to change notification settings - Fork 2
/
graphics-guide.qmd
1534 lines (1234 loc) · 59.2 KB
/
graphics-guide.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
includes:
in_header: analytics.html
css: styles.css
code_folding: show
toc: TRUE
toc_float: TRUE
pandoc_args:
"--tab-stop=2"
editor_options:
chunk_output_type: console
---
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato" />
::: {#header}
<img src="graphics-guide/www/images/urban-institute-logo.png" width="350"/>
:::
# Urban Institute R Graphics Guide
```{r setup, include=FALSE}
library(knitr)
library(datasets)
library(tidyverse)
library(urbnthemes)
set_urbn_defaults(style = "print")
opts_chunk$set(fig.path = "graphics-guide/www/images/")
opts_chunk$set(echo = TRUE)
opts_chunk$set(warning = FALSE)
opts_chunk$set(message = FALSE)
opts_chunk$set(fig.width = 6.5)
opts_chunk$set(fig.height = 4)
opts_chunk$set(fig.retina = 3)
options(scipen = 999)
```
R is a powerful, open-source programming language and environment. R excels at data management and munging, traditional statistical analysis, machine learning, and reproducible research, but it is probably best known for its graphics. This guide contains examples and instructions for popular and lesser-known plotting techniques in R. It also includes instructions for using `urbnthemes`, the Urban Institute's R package for creating near-publication-ready plots with `ggplot2`. If you have any questions, please don't hesitate to contact Aaron Williams (awilliams\@urban.org) or Kyle Ueyama (kueyama\@urban.org).
### Background
`library(urbnthemes)` makes `ggplot2` output align more closely with [the Urban Institute's Data Visualization style guide](http://urbaninstitute.github.io/graphics-styleguide/). This package does **not produce publication ready graphics**. Visual styles must still be edited using your project/paper's normal editing workflow.
Exporting charts as a pdf will allow them to be more easily edited. See the Saving Plots section for more information.
The theme has been tested against `ggplot2 version 3.0.0`. It will not function properly with older versions of `ggplot2`
### Using library(urbnthemes)
Run the following code to install or update `urbnthemes`:
```
install.packages("remotes")
remotes::install_github("UrbanInstitute/urbnthemes")
```
Run the following code at the top of each script:
```
library(tidyverse)
library(urbnthemes)
set_urbn_defaults(style = "print")
```
### Installing Lato {#installing_lato}
Your Urban computer may not have the Lato font installed. If it is not installed, please install the free [Lato font from Google](https://www.google.com/fonts/specimen/Lato). Below are step by step instructions:
1) Download the [Lato font](https://www.google.com/fonts/specimen/Lato) (as a zip file).
2) Unzip the file on your computer.
3) For each `.ttf` file in the unzipped `Lato/` folder, double click the file and click `Install` (on Windows) or `Install Font` (on Mac).
4) Import and register Lato into R by running `urbnthemes::lato_import()` in the console once. Be patient as this may take a few minutes!
5) To confirm installation, run `urbnthemes::lato_test()`. If this is successful you're done and Lato will automatically be used when creating plots with `library(urbnthemes)`. You only need to install Lato once per computer.
Waffle charts with glyphs require fontawesome. `fontawesome_test()` and `fontawesome_install()` are the fontawesome versions of the above functions. Be sure to install fontawesome from [here](https://github.com/hrbrmstr/waffle/tree/master/inst/fonts) first.
### Grammar of Graphics and Conventions
Hadley Wickham's ggplot2 is based on Leland Wilkinson's [*The Grammar of Graphics*](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448) and Wickham's [*A Layered Grammar of Graphics*](http://vita.had.co.nz/papers/layered-grammar.html). The layered grammar of graphics is a structured way of thinking about the components of a plot, which then lend themselves to the simple structure of ggplot2.
- **Data** are what are visualizaed in a plot and **mappings** are directions for how data are mapped in a plot in a way that can be perceived by humans.\
- **Geoms** are representations of the actual data like points, lines, and bars.
- **Stats** are statistical transformations that represent summaries of the data like histograms.
- **Scales** map values in the data space to values in the aesthetic space. Scales draw legends and axes.
- **Coordinate Systems** describe how geoms are mapped to the plane of the graphic.\
- **Facets** break the data into meaningful subsets like small multiples.
- **Themes** control the finer points of a plot such as fonts, font sizes, and background colors.
More information: [ggplot2: Elegant Graphics for Data Analysis](https://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403)
### Tips and Tricks
- `ggplot2` expects data to be in data frames or tibbles. It is preferable for the data frames to be "tidy" with each variable as a column, each obseravtion as a row, and each observational unit as a separate table. `dplyr` and `tidyr` contain concise and effective tools for "tidying" data.
- R allows function arguments to be called explicitly by name and implicitly by position. The coding examples in this guide only contain named arguments for clarity.
- Graphics will sometimes render differently on different operating systems. This is because anti-aliasing is activated in R on Mac and Linux but not activated in R on Windows. This won't be an issue once graphics are saved.
- Continuous x-axes have ticks. Discrete x-axes do not have ticks. Use `remove_ticks()` to remove ticks.
## Bar Plots
------------------------------------------------------------------------
### One Color
```{r barplots}
mtcars %>%
count(cyl) %>%
ggplot(mapping = aes(x = factor(cyl), y = n)) +
geom_col() +
geom_text(mapping = aes(label = n), vjust = -1) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(x = "Cylinders",
y = NULL) +
remove_ticks() +
remove_axis()
```
### One Color (Rotated)
This example introduces `coord_flip()` and `remove_axis(axis = "x", flip = TRUE)`. `remove_axis()` is from `library(urbnthemes)` and creates a custom theme for rotated bar plots.
```{r barplot-rotated}
mtcars %>%
count(cyl) %>%
ggplot(mapping = aes(x = factor(cyl), y = n)) +
geom_col() +
geom_text(mapping = aes(label = n), hjust = -1) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(x = "Cylinders",
y = NULL) +
coord_flip() +
remove_axis(axis = "x", flip = TRUE)
```
### Three Colors
This is identical to the previous plot except colors and a legend are added with `fill = cyl`. Turning `x` into a factor with `factor(cyl)` skips 5 and 7 on the `x-axis`. Adding `fill = cyl` without `factor()` would have created a continuous color scheme and legend.
```{r 3-color-barplot}
mtcars %>%
mutate(cyl = factor(cyl)) %>%
count(cyl) %>%
ggplot(mapping = aes(x = cyl, y = n, fill = cyl)) +
geom_col() +
geom_text(mapping = aes(label = n), vjust = -1) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(x = "Cylinders",
y = NULL) +
remove_ticks() +
remove_axis()
```
### Stacked Bar Plot
An additional aesthetic can easily be added to bar plots by adding `fill = categorical variable` to the mapping. Here, transmission type subsets each bar showing the count of cars with different numbers of cylinders.
```{r stacked-bar-plot}
mtcars %>%
mutate(am = factor(am, labels = c("Automatic", "Manual")),
cyl = factor(cyl)) %>%
group_by(am) %>%
count(cyl) %>%
group_by(cyl) %>%
arrange(desc(am)) %>%
mutate(label_height = cumsum(n)) %>%
ggplot() +
geom_col(mapping = aes(x = cyl, y = n, fill = am)) +
geom_text(aes(x = cyl, y = label_height - 0.5, label = n, color = am)) +
scale_color_manual(values = c("white", "black")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(x = "Cylinders",
y = NULL) +
remove_ticks() +
remove_axis() +
guides(color = "none")
```
### Stacked Bar Plot With Position = Fill
The previous examples used `geom_col()`, which takes a y value for bar height. This example uses `geom_bar()` which sums the values and generates a value for bar heights. In this example, `position = "fill"` in `geom_bar()` changes the y-axis from count to the proportion of each bar.
```{r stacked-bar-plot-fill}
mtcars %>%
mutate(am = factor(am, labels = c("Automatic", "Manual")),
cyl = factor(cyl)) %>%
ggplot() +
geom_bar(mapping = aes(x = cyl, fill = am), position = "fill") +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)), labels = scales::percent) +
labs(x = "Cylinders",
y = NULL) +
remove_ticks() +
guides(color = "none")
```
### Dodged Bar Plot
Subsetted bar charts in ggplot2 are stacked by default. `position = "dodge"` in `geom_col()` expands the bar chart so the bars appear next to each other.
```{r dodged-bar-plot}
mtcars %>%
mutate(am = factor(am, labels = c("Automatic", "Manual")),
cyl = factor(cyl)) %>%
group_by(am) %>%
count(cyl) %>%
ggplot(mapping = aes(cyl, y = n, fill = factor(am))) +
geom_col(position = "dodge") +
geom_text(aes(label = n), position = position_dodge(width = 0.7), vjust = -1) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(x = "Cylinders",
y = NULL) +
remove_ticks() +
remove_axis()
```
### Lollipop plot/Cleveland dot plot {.tabset}
Lollipop plots and Cleveland dot plots are minimalist alternatives to bar plots. The key to both plots is to order the data based on the continuous variable using `arrange()` and then turn the discrete variable into a factor with the ordered levels of the continuous variable using `mutate()`. This step "stores" the order of the data.
#### Lollipop plot
```{r lollipop-plot, fig.height = 5}
mtcars %>%
rownames_to_column("model") %>%
arrange(mpg) %>%
mutate(model = factor(model, levels = .$model)) %>%
ggplot(aes(mpg, model)) +
geom_segment(aes(x = 0, xend = mpg, y = model, yend = model)) +
geom_point() +
scale_x_continuous(expand = expansion(mult = c(0, 0)), limits = c(0, 40)) +
labs(x = NULL,
y = "Miles Per Gallon")
```
#### Cleveland dot plot
```{r cleveland-dot-plot, fig.height = 5}
mtcars %>%
rownames_to_column("model") %>%
arrange(mpg) %>%
mutate(model = factor(model, levels = .$model)) %>%
ggplot(aes(mpg, model)) +
geom_point() +
scale_x_continuous(expand = expansion(mult = c(0, 0)), limits = c(0, 40)) +
labs(x = NULL,
y = "Miles Per Gallon")
```
### Dumbell plot
## Scatter Plots
------------------------------------------------------------------------
### One Color Scatter Plot
Scatter plots are useful for showing relationships between two or more variables. Use `scatter_grid()` from `library(urbnthemes)` to easily add vertical grid lines for scatter plots.
```{r one-color-scatter-plot}
mtcars %>%
ggplot(mapping = aes(x = wt, y = mpg)) +
geom_point() +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 6),
breaks = 0:6) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
limits = c(0, 40),
breaks = 0:8 * 5) +
labs(x = "Weight (thousands of pounds)",
y = "City MPG") +
scatter_grid()
```
### High-Density Scatter Plot with Transparency
Large numbers of observations can sometimes make scatter plots tough to interpret because points overlap. Adding `alpha =` with a number between 0 and 1 adds transparency to points and clarity to plots. Now it's easy to see that jewelry stores are probably rounding up but not rounding down carats!
```{r alpha-scatter-plot}
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point(alpha = 0.05) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 6),
breaks = 0:6) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
limits = c(0, 20000),
breaks = 0:4 * 5000,
labels = scales::dollar) +
labs(x = "Carat",
y = "Price") +
scatter_grid()
```
### Hex Scatter Plot
Sometimes transparency isn't enough to bring clarity to a scatter plot with many observations. As n increases into the hundreds of thousands and even millions, `geom_hex` can be one of the best ways to display relationships between two variables.
```{r scatter-plot-hex}
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_hex(mapping = aes(fill = after_stat(count))) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 6),
breaks = 0:6) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
limits = c(0, 20000),
breaks = 0:4 * 5000,
labels = scales::dollar) +
scale_fill_gradientn(labels = scales::comma) +
labs(x = "Carat",
y = "Price") +
scatter_grid() +
theme(legend.position = "right",
legend.direction = "vertical")
```
### Scatter Plots With Random Noise {.tabset}
Sometimes scatter plots have many overlapping points but a reasonable number of observations. `geom_jitter` adds a small amount of random noise so points are less likely to overlap. `width` and `height` control the amount of noise that is added. In the following before-and-after, notice how many more points are visible after adding jitter.
#### Before
```{r before-scatter-plot}
mpg %>%
ggplot(mapping = aes(x = displ, y = cty)) +
geom_point() +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 8),
breaks = 0:8) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
limits = c(0, 40),
breaks = 0:4 * 10) +
labs(x = "Displacement",
y = "City MPG") +
scatter_grid()
```
#### After
```{r jitter-plot}
set.seed(2017)
mpg %>%
ggplot(mapping = aes(x = displ, y = cty)) +
geom_jitter() +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 8),
breaks = 0:8) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
limits = c(0, 40),
breaks = 0:4 * 10) +
labs(x = "Displacement",
y = "City MPG") +
scatter_grid()
```
### Scatter Plots with Varying Point Size
Weights and populations can be mapped in scatter plots to the size of the points. Here, the number of households in each state is mapped to the size of each point using `aes(size = hhpop)`. Note: `ggplot2::geom_point()` is used instead of `geom_point()`.
```{r geom_point-size, fig.height = 5}
urbnmapr::statedata %>%
ggplot(mapping = aes(x = medhhincome, y = horate)) +
ggplot2::geom_point(mapping = aes(size = hhpop), alpha = 0.3) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(30000, 80000),
breaks = 3:8 * 10000,
labels = scales::dollar) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
limits = c(0, 0.8),
breaks = 0:4 * 0.2) +
scale_radius(range = c(3, 15),
breaks = c(2500000, 7500000, 12500000),
labels = scales::comma) +
labs(x = "Household income",
y = "Homeownership rate") +
scatter_grid() +
theme(plot.margin = margin(r = 20))
```
### Scatter Plots with Fill
A third aesthetic can be added to scatter plots. Here, color signifies the number of cylinders in each car. Before `ggplot()` is called, Cylinders is created using `library(dplyr)` and the piping operator `%>%`.
```{r filled-scatter-plot}
mtcars %>%
mutate(cyl = paste(cyl, "cylinders")) %>%
ggplot(aes(x = wt, y = mpg, color = cyl)) +
geom_point() +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 6),
breaks = 0:6) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
limits = c(0, 40),
breaks = 0:8 * 5) +
labs(x = "Weight (thousands of pounds)",
y = "City MPG") +
scatter_grid()
```
## Line Plots
------------------------------------------------------------------------
```{r line-plots}
economics %>%
ggplot(mapping = aes(x = date, y = unemploy)) +
geom_line() +
scale_x_date(expand = expansion(mult = c(0.002, 0)),
breaks = "10 years",
limits = c(as.Date("1961-01-01"), as.Date("2020-01-01")),
date_labels = "%Y") +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
breaks = 0:4 * 4000,
limits = c(0, 16000),
labels = scales::comma) +
labs(x = "Year",
y = "Number Unemployed (1,000s)")
```
### Lines Plots With Multiple Lines
```{r multiple-line-charts1}
library(gapminder)
gapminder %>%
filter(country %in% c("Australia", "Canada", "New Zealand")) %>%
mutate(country = factor(country, levels = c("Canada", "Australia", "New Zealand"))) %>%
ggplot(aes(year, gdpPercap, color = country)) +
geom_line() +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
breaks = c(1952 + 0:12 * 5),
limits = c(1952, 2007)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
breaks = 0:8 * 5000,
labels = scales::dollar,
limits = c(0, 40000)) +
labs(x = "Year",
y = "Per capita GDP (US dollars)")
```
Plotting more than one variable can be useful for seeing the relationship of variables over time, but it takes a small amount of data munging.
This is because `ggplot2` wants data in a "long" format instead of a "wide" format for line plots with multiple lines. `gather()` and `spread()` from the `tidyr` package make switching back-and-forth between "long" and "wide" painless. Essentially, variable titles go into "key" and variable values go into "value". Then ggplot2, turns the different levels of the key variable (population, unemployment) into colors.
```{r multiple-line-charts2}
as_tibble(EuStockMarkets) %>%
mutate(date = time(EuStockMarkets)) %>%
gather(key = "key", value = "value", -date) %>%
ggplot(mapping = aes(x = date, y = value, color = key)) +
geom_line() +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(1991, 1999),
breaks = c(1991, 1993, 1995, 1997, 1999)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
breaks = 0:4 * 2500,
labels = scales::dollar,
limits = c(0, 10000)) +
labs(x = "Date",
y = "Value")
```
### Step plot
`geom_line()` connects coordinates with the shortest possible straight line. Sometimes step plots are necessary because y values don't change between coordinates. For example, the upper-bound of the Federal Funds Rate is set at regular intervals and remains constant until it is changed.
```{r step-plot}
# downloaded from FRED on 2018-12-06
# https://fred.stlouisfed.org/series/DFEDTARU
fed_fund_rate <- read_csv(
"date, fed_funds_rate
2014-01-01,0.0025
2015-12-16,0.0050
2016-12-14,0.0075
2017-03-16,0.0100
2017-06-15,0.0125
2017-12-14,0.0150
2018-03-22,0.0175
2018-06-14,0.0200
2018-09-27,0.0225
2018-12-06,0.0225")
fed_fund_rate %>%
ggplot(mapping = aes(x = date, y = fed_funds_rate)) +
geom_step() +
scale_x_date(expand = expansion(mult = c(0.002, 0)),
breaks = "1 year",
limits = c(as.Date("2014-01-01"), as.Date("2019-01-01")),
date_labels = "%Y") +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
breaks = c(0, 0.01, 0.02, 0.03),
limits = c(0, 0.03),
labels = scales::percent) +
labs(x = "Date",
y = "Upper-bound of the Federal Funds Rate")
```
### Path plot
The Beveridge curve is a macroeconomic plot that displays a relationship between the unemployment rate and the vacancy rate. Movements along the curve indicate changes in the business cyle and horizontal shifts of the curve suggest structural changes in the labor market.
Lines in Beveridge curves do not monotonically move from left to right. Therefore, it is necessary to use `geom_path()`.
```{r, path-plot}
# seasonally-adjusted, quarterly vacancy rate - JOLTS # seasonally-adjusted, quarterly unemployment rate - CPS
# pulled from FRED on April 11, 2018.
library(ggrepel)
beveridge <- read_csv(
"quarter, vacanacy_rate, unempoyment_rate
2006-01-01,0.0310,0.0473
2006-04-01,0.0316,0.0463
2006-07-01,0.0313,0.0463
2006-10-01,0.0310,0.0443
2007-01-01,0.0323,0.0450
2007-04-01,0.0326,0.0450
2007-07-01,0.0316,0.0466
2007-10-01,0.0293,0.0480
2008-01-01,0.0286,0.0500
2008-04-01,0.0280,0.0533
2008-07-01,0.0253,0.0600
2008-10-01,0.0220,0.0686
2009-01-01,0.0196,0.0826
2009-04-01,0.0180,0.0930
2009-07-01,0.0176,0.0963
2009-10-01,0.0180,0.0993
2010-01-01,0.0196,0.0983
2010-04-01,0.0220,0.0963
2010-07-01,0.0216,0.0946
2010-10-01,0.0220,0.0950
2011-01-01,0.0226,0.0903
2011-04-01,0.0236,0.0906
2011-07-01,0.0250,0.0900
2011-10-01,0.0243,0.0863
2012-01-01,0.0270,0.0826
2012-04-01,0.0270,0.0820
2012-07-01,0.0266,0.0803
2012-10-01,0.0260,0.0780
2013-01-01,0.0276,0.0773
2013-04-01,0.0280,0.0753
2013-07-01,0.0280,0.0723
2013-10-01,0.0276,0.0693
2014-01-01,0.0290,0.0666
2014-04-01,0.0323,0.0623
2014-07-01,0.0326,0.0610
2014-10-01,0.0330,0.0570
2015-01-01,0.0350,0.0556
2015-04-01,0.0366,0.0540
2015-07-01,0.0373,0.0510
2015-10-01,0.0360,0.0500
2016-01-01,0.0386,0.0493
2016-04-01,0.0383,0.0486
2016-07-01,0.0383,0.0493
2016-10-01,0.0363,0.0473
2017-01-01,0.0366,0.0466
2017-04-01,0.0390,0.0433
2017-07-01,0.0406,0.0430
2017-10-01,0.0386,0.0410")
labels <- beveridge %>%
filter(lubridate::month(quarter) == 1)
beveridge %>%
ggplot() +
geom_path(mapping = aes(x = unempoyment_rate, y = vacanacy_rate), alpha = 0.5) +
geom_point(data = labels, mapping = aes(x = unempoyment_rate, y = vacanacy_rate)) +
geom_text_repel(data = labels, mapping = aes(x = unempoyment_rate, y = vacanacy_rate, label = lubridate::year(quarter))) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0.04, 0.1),
labels = scales::percent) +
scale_y_continuous(expand = expansion(mult = c(0, 0.002)),
breaks = c(0, 0.01, 0.02, 0.03, 0.04, 0.05),
limits = c(0, 0.05),
labels = scales::percent) +
labs(x = "Seasonally-adjusted unemployment rate",
y = "Seasonally-adjusted vacancy rate") +
scatter_grid()
```
### Slope plots
```{r slope-plot, fig.height = 5}
# https://www.bls.gov/lau/
library(ggrepel)
unemployment <- tibble(
time = c("October 2009", "October 2009", "October 2009", "August 2017", "August 2017", "August 2017"),
rate = c(7.4, 7.1, 10.0, 3.9, 3.8, 6.4),
state = c("Maryland", "Virginia", "Washington, D.C.", "Maryland", "Virginia", "Washington, D.C.")
)
label <- tibble(label = c("October 2009", "August 2017"))
october <- filter(unemployment, time == "October 2009")
august <- filter(unemployment, time == "August 2017")
unemployment %>%
mutate(time = factor(time, levels = c("October 2009", "August 2017")),
state = factor(state, levels = c("Washington, D.C.", "Maryland", "Virginia"))) %>%
ggplot() +
geom_line(aes(time, rate, group = state, color = state), show.legend = FALSE) +
geom_point(aes(x = time, y = rate, color = state)) +
labs(subtitle = "Unemployment Rate") +
theme(axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank(),
axis.line = element_blank()) +
geom_text_repel(data = october, mapping = aes(x = time, y = rate, label = as.character(rate)), nudge_x = -0.06) +
geom_text_repel(data = august, mapping = aes(x = time, y = rate, label = as.character(rate)), nudge_x = 0.06)
```
## Univariate
------------------------------------------------------------------------
There are a number of ways to explore the distributions of univariate data in R. Some methods, like strip charts, show all data points. Other methods, like the box and whisker plot, show selected data points that communicate key values like the median and 25th percentile. Finally, some methods don't show any of the underlying data but calculate density estimates. Each method has advantages and disadvantages, so it is worthwhile to understand the different forms. For more information, read [40 years of boxplots](http://vita.had.co.nz/papers/boxplots.pdf) by Hadley Wickham and Lisa Stryjewski.
### Strip Chart
Strip charts, the simplest univariate plot, show the distribution of values along one axis. Strip charts work best with variables that have plenty of variation. If not, the points tend to cluster on top of each other. Even if the variable has plenty of variation, it is often important to add transparency to the points with `alpha =` so overlapping values are visible.
```{r stripchart, fig.height=2}
msleep %>%
ggplot(aes(x = sleep_total, y = factor(1))) +
geom_point(alpha = 0.2, size = 5) +
labs(y = NULL) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 25),
breaks = 0:5 * 5) +
scale_y_discrete(labels = NULL) +
labs(title = "Total Sleep Time of Different Mammals",
x = "Total sleep time (hours)",
y = NULL) +
theme(axis.ticks.y = element_blank())
```
### Strip Chart with Highlighting
Because strip charts show all values, they are useful for showing where selected points lie in the distribution of a variable. The clearest way to do this is by adding `geom_point()` twice with `filter()` in the data argument. This way, the highlighted values show up on top of unhighlighted values.
```{r stripchart-with-highlighting, fig.height=2}
ggplot() +
geom_point(data = filter(msleep, name != "Red fox"),
aes(x = sleep_total,
y = factor(1)),
alpha = 0.2,
size = 5,
color = "grey50") +
geom_point(data = filter(msleep, name == "Red fox"),
aes(x = sleep_total,
y = factor(1),
color = name),
alpha = 0.8,
size = 5) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 25),
breaks = 0:5 * 5) +
scale_y_discrete(labels = NULL) +
labs(title = "Total Sleep Time of Different Mammals",
x = "Total sleep time (hours)",
y = NULL,
legend) +
guides(color = guide_legend(title = NULL)) +
theme(axis.ticks.y = element_blank())
```
### Subsetted Strip Chart
Add a y variable to see the distributions of the continuous variable in subsets of a categorical variable.
```{r subsetted-stripchart, fig.height=3}
library(forcats)
msleep %>%
filter(!is.na(vore)) %>%
mutate(vore = fct_recode(vore,
"Insectivore" = "insecti",
"Omnivore" = "omni",
"Herbivore" = "herbi",
"Carnivore" = "carni"
)) %>%
ggplot(aes(x = sleep_total, y = vore)) +
geom_point(alpha = 0.2, size = 5) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 25),
breaks = 0:5 * 5) +
labs(title = "Total Sleep Time of Different Mammals by Diet",
x = "Total sleep time (hours)",
y = NULL) +
theme(axis.ticks.y = element_blank())
```
### Beeswarm Plots
Beesward plots are a variation of strip charts that shows the distribution of data, but without the points overlaping.
```{r beeswarm}
library(ggbeeswarm)
txhousing %>%
filter(city %in% c("Austin","Houston","Dallas","San Antonio","Fort Worth")) %>%
ggplot(aes(x = median, y = city)) +
geom_beeswarm(alpha = 0.2, size = 5) +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Household Sale Price by City",
x = "Sale Price",
y = NULL) +
theme(axis.ticks.y = element_blank())
```
### Histograms
Histograms divide the distribution of a variable into n equal-sized bins and then count and display the number of observations in each bin. Histograms are sensitive to bin width. As `?geom_histogram` notes, "You should always override \[the default binwidth\] value, exploring multiple widths to find the best to illustrate the stories in your data."
```{r histogram}
ggplot(data = diamonds, mapping = aes(x = depth)) +
geom_histogram(bins = 100) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, 100)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.2)), labels = scales::comma) +
labs(x = "Depth",
y = "Count")
```
### Boxplots
Boxplots were invented in the 1970s by John Tukey[^1]. Instead of showing the underlying data or binned counts of the underlying data, they focus on important values like the 25th percentile, median, and 75th percentile.
[^1]: Wickham, H., & Stryjewski, L. (2011). 40 years of boxplots.
```{r box-plot}
InsectSprays %>%
ggplot(mapping = aes(x = spray, y = count)) +
geom_boxplot() +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
labs(x = "Type of insect spray",
y = "Number of dead insects") +
remove_ticks()
```
### Smoothed Kernel Density Plots
Continuous variables with smooth distributions are sometimes better represented with smoothed kernel density estimates than histograms or boxplots. `geom_density()` computes and plots a kernel density estimate. Notice the lumps around integers and halves in the following distribution because of rounding.
```{r kernel-density-plot}
diamonds %>%
ggplot(mapping = aes(carat)) +
geom_density(color = NA) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, NA)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
labs(x = "Carat",
y = "Density")
```
```{r kernel-density-plot-filled}
diamonds %>%
mutate(cost = ifelse(price > 5500, "More than $5,500 +", "$0 to $5,500")) %>%
ggplot(mapping = aes(carat, fill = cost)) +
geom_density(alpha = 0.25, color = NA) +
scale_x_continuous(expand = expansion(mult = c(0.002, 0)),
limits = c(0, NA)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(x = "Carat",
y = "Density")
```
### Ridgeline Plots
Ridgeline plots are partially overlapping smoothed kernel density plots faceted by a categorical variable that pack a lot of information into one elegant plot.
```{r ridgeline-plots}
library(ggridges)
ggplot(diamonds, mapping = aes(x = price, y = cut)) +
geom_density_ridges(fill = "#1696d2") +
labs(x = "Price",
y = "Cut")
```
### Violin Plots
Violin plots are symmetrical displays of smooth kernel density plots.
```{r violin-plot}
InsectSprays %>%
ggplot(mapping = aes(x = spray, y = count, fill = spray)) +
geom_violin(color = NA) +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
labs(x = "Type of insect spray",
y = "Number of dead insects") +
remove_ticks()
```
### Bean Plot
Individual outliers and important summary values are not visible in violin plots or smoothed kernel density plots. Bean plots, [created by Peter Kampstra in 2008](https://www.jstatsoft.org/article/view/v028c01), are violin plots with data shown as small lines in a one-dimensional sstrip plot and larger lines for the mean.
```{r beanplot}
msleep %>%
filter(!is.na(vore)) %>%
mutate(vore = fct_recode(vore,
"Insectivore" = "insecti",
"Omnivore" = "omni",
"Herbivore" = "herbi",
"Carnivore" = "carni"
)) %>%
ggplot(aes(x = vore, y = sleep_total, fill = vore)) +
stat_summary(fun = "mean",
colour = "black",
size = 30,
shape = 95,
geom = "point") +
geom_violin(color = NA) +
geom_jitter(width = 0,
height = 0.05,
alpha = 0.4,
shape = "-",
size = 10,
color = "grey50") +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
labs(x = NULL,
y = "Total sleep time (hours)") +
theme(legend.position = "none") +
remove_ticks()
```
## Area Plot
------------------------------------------------------------------------
### Stacked Area
```{r area-plot-stack}
txhousing %>%
filter(city %in% c("Austin","Houston","Dallas","San Antonio","Fort Worth")) %>%
group_by(city, year) %>%
summarize(sales = sum(sales)) %>%
ggplot(aes(x = year, y = sales, fill = city)) +
geom_area(position = "stack") +
scale_x_continuous(expand = expansion(mult = c(0, 0)),
limits = c(2000, 2015),
breaks = 2000 + 0:15) +
scale_y_continuous(expand = expansion(mult = c(0, 0.2)),
labels = scales::comma) +
labs(x = "Year",
y = "Home sales")
```
### Filled Area
```{r area-plot-fill}
txhousing %>%
filter(city %in% c("Austin","Houston","Dallas","San Antonio","Fort Worth")) %>%
group_by(city, year) %>%
summarize(sales = sum(sales)) %>%
ggplot(aes(x = year, y = sales, fill = city)) +
geom_area(position = "fill") +
scale_x_continuous(expand = expansion(mult = c(0, 0)),
limits = c(2000, 2015),
breaks = 2000 + 0:15) +
scale_y_continuous(expand = expansion(mult = c(0, 0.02)),
breaks = c(0, 0.25, 0.5, 0.75, 1),
labels = scales::percent) +
labs(x = "Year",
y = "Home sales")
```
## Sankey Plot
------------------------------------------------------------------------
Sankey plots visualize flows from one set of variables to another. This can be useful for showing outcomes from the start of a program to the end. You'll need to install the `ggsankey` package to create Sankey plots in R. In this example I make a dummy data set of housing status prior to program start and at exit to show the flow of people between outcomes. A key step is to transform your data set using the `make_long` function from the package. This creates a data frame that specifies each of the initial nodes and how they flow into the next stage.
```{r}
# load ggsankey package
remotes::install_github("davidsjoberg/ggsankey")
library(ggsankey)
# create a dummy dataset of housing status
df <- data_frame(entry_status = c(rep("Housed", 7), rep("Unhoused", 15), rep("Staying w/ Family", 8)),
exit_status = c(rep("Housed", 15), rep("Unhoused", 2), rep("Staying w/ Family", 13))) %>%
# transform the data frame into the proper format for the sankey plot
make_long(entry_status, exit_status) %>%
# recode the labels to be cleaner in the plot
mutate(x = recode(x, entry_status = "Prior Housing Status", exit_status = "Exit Housing Status"),
next_x = recode(next_x, entry_status = "Prior Housing Status", exit_status = "Exit Housing Status"))
# create sankey plot
ggplot(df, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node)) +
geom_sankey(flow.alpha = 0.5, node.color = 1, show.legend = FALSE) +
# add labels to plot and style
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
theme_sankey(base_size = 16)+
labs(x = NULL)
```
## Heat Map
------------------------------------------------------------------------
```{r heat-map}
library(fivethirtyeight)
bad_drivers %>%
filter(state %in% c("Maine", "New Hampshire", "Vermont", "Massachusetts", "Connecticut", "New York")) %>%
mutate(`Number of\nDrivers` = scale(num_drivers),
`Percent\nSpeeding` = scale(perc_speeding),
`Percent\nAlcohol` = scale(perc_alcohol),
`Percent Not\nDistracted` = scale(perc_not_distracted),
`Percent No\nPrevious` = scale(perc_no_previous),
state = factor(state, levels = rev(state))
) %>%
select(-insurance_premiums, -losses, -(num_drivers:losses)) %>%
gather(`Number of\nDrivers`:`Percent No\nPrevious`, key = "variable", value = "SD's from Mean") %>%
ggplot(aes(variable, state)) +
geom_tile(aes(fill = `SD's from Mean`)) +
labs(x = NULL,
y = NULL) +
scale_fill_gradientn() +
theme(legend.position = "right",
legend.direction = "vertical",
axis.line.x = element_blank(),
panel.grid.major.y = element_blank()) +
remove_ticks()
#https://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/
```
## Faceting and Small Multiples
------------------------------------------------------------------------
### facet_wrap()
R's faceting system is a powerful way to make "small multiples".
Some edits to the theme may be necessary depending upon how many rows and columns are in the plot.
```{r small-multiples, fig.height=2}
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point(alpha = 0.05) +
facet_wrap(~cut, ncol = 5) +
scale_x_continuous(expand = expansion(mult = c(0, 0)),
limits = c(0, 6)) +
scale_y_continuous(expand = expansion(mult = c(0, 0)),
limits = c(0, 20000),
labels = scales::dollar) +
labs(x = "Carat",
y = "Price") +
scatter_grid()
```
### facet_grid()
```{r faceting, fig.height=7}
diamonds %>%
filter(color %in% c("D", "E", "F", "G")) %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point(alpha = 0.05) +
facet_grid(color ~ cut) +
scale_x_continuous(expand = expansion(mult = c(0, 0)),
limits = c(0, 4)) +
scale_y_continuous(expand = expansion(mult = c(0, 0)),
limits = c(0, 20000),
labels = scales::dollar) +
labs(x = "Carat",
y = "Price") +
theme(panel.spacing = unit(20L, "pt")) +
scatter_grid()
```
## Smoothers
------------------------------------------------------------------------
`geom_smooth()` fits and plots models to data with two or more dimensions.
Understanding and manipulating defaults is more important for `geom_smooth()` than other geoms because it contains a number of assumptions. `geom_smooth()` automatically uses loess for datasets with fewer than 1,000 observations and a generalized additive model with `formula = y ~ s(x, bs = "cs")` for datasets with greater than 1,000 observations. Both default to a 95% confidence interval with the confidence interval displayed.
Models are chosen with `method =` and can be set to lm(), glm(), gam(), loess(), rlm(), and more. Formulas can be specified with `formula =` and `y ~ x` syntax. Plotting the standard error is toggled with `se = TRUE` and `se = FALSE`, and level is specificed with `level =`. As always, more information can be seen in RStudio with `?geom_smooth()`.