Skip to content

Commit

Permalink
Merge pull request #290 from ytakemon/iss_YT
Browse files Browse the repository at this point in the history
Addressing issues in mega issue section: R basics continued - factors and data frames
  • Loading branch information
naupaka authored Oct 3, 2024
2 parents 9e3d945 + 4752cd1 commit 3c1641f
Showing 1 changed file with 45 additions and 13 deletions.
58 changes: 45 additions & 13 deletions episodes/03-basics-factors-dataframes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,15 @@ you have the `variants` object, listed as 801 obs. (observations/rows)
of 29 variables (columns). Double-clicking on the name of the object will open
a view of the data in a new tab.

![RStudio data frame view]("fig/rstudio_dataframeview.png")
![RStudio data frame view]("epidoes/fig/rstudio_dataframeview.png")

Check warning on line 188 in episodes/03-basics-factors-dataframes.Rmd

View workflow job for this annotation

GitHub Actions / Build Full Site

[image missing alt-text]: "epidoes/fig/rstudio_dataframeview.png"

We can also quickly query the dimensions of the variable using `dim()`. You'll see that the first number `801` shows the number of rows, then `29` the number of columns

```{r, purl=FALSE}
## get summary statistics on a data frame
dim(variants)
```

## Summarizing, subsetting, and determining the structure of a data frame.

Expand All @@ -208,12 +216,17 @@ these columns, as well as mean, median, and interquartile ranges. Many of the
other variables (e.g. `sample_id`) are treated as characters data (more on this
in a bit).

There is a lot to work with, so we will subset the first three columns into a
new data frame using the `data.frame()` function.
There is a lot to work with, so we will subset the columns into a new data frame using
the `data.frame()` function. To subset/index a two dimensional variable, we need to
define them on the appropriate side of the brackets. The left hand side of the comma
indicates the rows you want to subset, and the right is the column position
(e.g. ["row index", "column index"]).

```{r, purl=FALSE}
## put the first three columns of variants into a new data frame called subset
Let's put the columns 1, 2, 3, and 6 into a new data frame called subset:

```{r, purl=FALSE}
## Notice that we are wrapping the numbers in a c() function, to indicate a vector
## in the right hand side of the comma.
subset <- data.frame(variants[, c(1:3, 6)])
```

Expand All @@ -228,12 +241,13 @@ str(subset)

Ok, thats a lot up unpack! Some things to notice.

- the object type `data.frame` is displayed in the first row along with its
- The object type `data.frame` is displayed in the first row along with its
dimensions, in this case 801 observations (rows) and 4 variables (columns)
- Each variable (column) has a name (e.g. `sample_id`). This is followed
by the object mode (e.g. chr, int, etc.). Notice that before each
- Each variable (column) has a name (e.g. `sample_id`). Notice that before each
variable name there is a `$` - this will be important later.

- Each variable name is followed by the data type it contains (e.g. chr, int, etc.).
The `int` type shows an integer, which is a type of numerical data, where it can only
store whole numbers (i.e. no decimal points ).


::::::::::::::::::::::::::::::::::::::: challenge
Expand Down Expand Up @@ -297,10 +311,19 @@ head(alt_alleles)
```

There are 801 alleles (one for each row). To simplify, lets look at just the
single-nucleotide alleles (SNPs). We can use some of the vector indexing skills
from the last episode.
single-nucleotide alleles (SNPs).

Let's review some of the vector indexing skills from the last episode that can help:

```{r, purl=FALSE}
# This will find all matching alleles with the single nucleotide "A" and provide a TRUE/FASE vector
alt_alleles == "A"
# Then, we wrap them into an index to pull all the positions that match this.
alt_alleles[alt_alleles == "A"]
# If we repeat this for each nucleotide A, T, G, and C, and connect them using `c()`,
# we can index all the single nucleotide changes.
snps <- c(alt_alleles[alt_alleles == "A"],
alt_alleles[alt_alleles=="T"],
alt_alleles[alt_alleles=="G"],
Expand All @@ -318,7 +341,13 @@ plot(snps)
```

Whoops! Though the `plot()` function will do its best to give us a quick plot,
it is unable to do so here. One way to fix this it to tell R to treat the SNPs
it is unable to do so here. Let's use `str()` to see why this might be:

```{r, purl=FALSE}
str(snps)
```

R may not know how to plot a character vector! One way to fix this it to tell R to treat the SNPs
as categories (i.e. a factor vector); we will create a new object to avoid
confusion using the `factor()` function:

Expand Down Expand Up @@ -349,9 +378,12 @@ We can see how many items in our vector fall into each category:

```{r, purl=FALSE}
summary(factor_snps)
# Compare the character vector
summary(snps)
```

As you can imagine, this is already useful when you want to generate a tally.
As you can imagine, factors are already useful when you want to generate a tally.

::::::::::::::::::::::::::::::::::::::::: callout

Expand Down

0 comments on commit 3c1641f

Please sign in to comment.