See notes.R to follow along in RStudio.
Also see the R input from the lecture.
The Wednesday 1:10 pm discussion will be in Surge III 1283.
The Wednesday 2-4 pm office hours will be in Kerr 165.
The Friday 1-3 pm office hours will be in SSH 1246.
R ignores lines starting with the # symbol. These are called comments. Use comments to plan out what you're trying to do, and to explain tricky pieces of code.
Last time we looked some functions for changing R's working directory:
- getwd(), setwd(), list.files()
Then we loaded an RDS file with readRDS()
and looked at some functions for
inspecting the structure of a data set:
- head(), tail()
- nrow(), ncol(), dim()
- str()
- typeof(), class()
- rownames(), colnames(), names()
We also saw some functions for inspecting the content of a data set:
- $ operator, to extract a column
- table()
- sort()
Let's start with a few more of these.
x = readRDS("data/sts98.rds")
This data set has everyone's class level, major, and department. We can get
a quick summary of the data with the summary()
function:
summary(x)
This data set only has categorical information, so we just see counts. Last
time we saw that the table()
function computes counts, too. Unlike
summary()
, the count for every category is computed:
table(x$Major)
sort(table(x$Major))
You can also use table()
to count how many observations fall into pairs of
categories, across two columns. This is called cross-tabulation. For
instance, we can cross-tabulate class level and major:
table(x$Class, x$Major)
This gives us a very detailed breakdown of the data. Like any other value in
R, we can save this to a variable. The addmargins()
function adds row and
column sums to the table:
ctab = table(x$Major, x$Class)
addmargins(ctab)
It's a good idea to use variables to save intermediate steps in your code so that if something goes wrong, you can figure out where the problem is.
Q: What happens if we sort the cross-tabulated values?
sort(table(x$Class, x$Major))
This is probably not what we want. The counts in the table get sorted from
smallest to largest, but the row and column names are blown away! The
sort()
function only works well on vectors. Later on we'll learn how to
sort a table.
Here's an example of addmargins()
for a smaller part of the data:
# Get first 10 rows.
y = head(x, 10)
# Remove categories that aren't present in the first 10 rows; otherwise R
# remembers them.
y$Class = factor(y$Class)
y$Major = factor(y$Major)
# Cross-tabulate and add the margins.
table(y$Class, y$Major)
addmargins(table(y$Class, y$Major))
Data can be stored in many different formats. Most of the time, file names end with a dot followed by 3 or more letters. This is called an extension and indicates the format of the file.
R understands many different formats. In this class we'll see:
Format | Extension | R Function |
---|---|---|
R data | .rds | readRDS |
R data | .rda | load |
Comma-separated values (CSV) | .csv | read.csv |
Tab-separated values (TSV) | .tsv | read.delim |
Plain-text tables | read.table |
The extension is just a hint! Sometimes it may be incorrect or misleading.
For the rest of this lecture, we'll use the NOAA Significant Earthquakes
Database. There's a download link in the data/
directory of the lecture
notes.
Despite the .txt
extension, this is a TSV file, because:
- The download page says it's a "tab-delimited format".
- Opening the file in a text editor shows tabs between the columns.
x = read.delim("signif.txt")
This is a much larger data set than the STS 98 roster.
dim(x)
names(x)
The names()
function is unusual because it also allows us to change the
names. Let's change them to lowercase to save some typing.
names(x) = tolower(names(x))
First of all, do you notice anything confusing about this data set?
str(x)
A lot of values show up as NA. NA is a special value in R that represents missing data. In other words, NA means no measurement was recorded. This is common in real data sets.
For instance, in the earthquake data, details are scarce on earthquakes that happened thousands of years ago.
With the table()
function, you can find out how many values are missing by
setting the useNA
parameter to 'always'
:
table(x$intensity, useNA = 'always')
Many other functions have options for handling NAs as well. For instance, an
inspection function we didn't discuss yet is range()
. It just reports the
smallest and largest values in a vector.
range(x$year)
range(x$intensity)
range(x$intensity, na.rm = TRUE)
The reasoning is that if there are unknown values, the range is unknown. So
we have to explicitly tell range()
to ignore unknown values.
Q: What's the type of NA?
A: It depends. By default it's logical:
typeof(NA)
But NA can be converted to any type, to match other data in a vector. For instance, the NAs in the intensity column are integers:
head(x$intensity, 1)
typeof(head(x$intensity, 1))
If we only need to check the number of NAs, there's an alternative to
table()
. The is.na()
function returns TRUE or FALSE whether the elements
of the parameter are NAs.
is.na(NA)
is.na(5)
is.na(c(NA, 4, 18, NA))
is.na(x$intensity)
Since R represents TRUE as 1 and FALSE as 0, we can sum up the TRUES and FALSES to get the number of NAs:
sum(is.na(x$deaths))
Q: What if we want the number of values that aren't missing?
A: There are three ways to do this. Number of observations minus number of missing values:
nrow(x) - sum(is.na(x$intensity))
Number of values minus number of missing values:
length(x$intensity) - sum(is.na(x$intensity))
Number of values that are not (! means "not") missing:
sum(!is.na(x$intensity))
We can use functions we learned before to find out which countries had the most earthquakes:
tail(sort(table(x$country)))
summary(x$country)
What if we just want data on USA earthquakes? The subset()
function lets
us take a subset of rows. The first parameter is a data set, and the second
parameter is a condition.
x_us = subset(x, country == "USA")
Since =
already stands for assignment, R uses ==
to stand for equality.
The result is TRUE or FALSE vector.
5 == 5
5 == 6
c(6, -2, 8) == c(6, 3, 8)
Notice that we can also compare a vector against a single value:
c(6, -2, 8) == 8
8 == c(6, -2, 8)
This is called recycling, because the single value gets reused for each comparsion. We didn't discuss it before, but all vectorized R functions recycle:
8 + c(2, 4, 6)
Q: How can we combine conditions? What if we want quakes that happened in the USA or happened before year 0?
A: Use a logical operator like | ("or") or & ("and"). These are discussed later on in the notes.
Condition to get quakes before year 0:
year < 0
Get quakes that are in USA or before year 0:
x_us_bc = subset(x, country == "USA" | year < 0)
Q: Can we save conditions in a variable?
A: Yes. Let's say we wanted to get a count of how many quakes happened before year 0. Then:
from_bc = x$year < 0
# Before year 0.
sum(from_bc)
# After year 0.
sum(!from_bc)
Now we've seen ==
and all()
for working with comparisons, but there's
several other useful functions:
- Comparisons: ==, <, >, <=, >=
- Logic Operators
- or: |, ||
- and: &, &&
- not: !
- quantifiers ("how many"): any(), all()
Let's look at some examples.
For |
, left OR right must be TRUE to get TRUE:
TRUE | FALSE
c(TRUE, TRUE, FALSE, FALSE) | c(TRUE, FALSE, TRUE, FALSE)
# Get quakes from year -2150 OR in Jordan. Either or both conditions must be
# true.
subset(x, year == -2150 | country == "JORDAN")
For &
, left AND right must be TRUE to get TRUE:
TRUE & FALSE
c(TRUE, TRUE, FALSE, FALSE) & c(TRUE, FALSE, TRUE, FALSE)
# Get quakes from year -2150 AND in Jordan. Both conditions must be true.
subset(x, year == -2150 & country == "JORDAN")
Beware! The ||
and &&
operators are faster but only check the first
element:
c(TRUE, FALSE, TRUE) || c(TRUE, FALSE, FALSE)
The not operator returns the opposite:
!TRUE
!c(TRUE, FALSE)
Also note we can abbreviate TRUE and FALSE as T and F:
T | F
It's safer to use TRUE or FALSE, because T and F are just variables and can be assigned other values:
T = 6
Q: Can I reset T and F?
A: Yes, use rm()
to remove a variable assignment.
rm(T)
TRUE and FALSE are special values in R; the logical operators don't work on strings:
"TRUE" | "F"