Skip to content

Commit

Permalink
Merge pull request #47 from fhdsl/looks-nice
Browse files Browse the repository at this point in the history
Looks nice
  • Loading branch information
caalo authored Jun 6, 2024
2 parents 44d83af + dc6649b commit cc8d3d7
Show file tree
Hide file tree
Showing 7 changed files with 55 additions and 36 deletions.
58 changes: 36 additions & 22 deletions 01-intro-to-computing.Rmd
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
# Intro to Computing

Welcome to Introduction to R! Each week, we cover a chapter, which consists of a lesson and exercise. In our first week together, we will look at big conceptual themes in programming, see how code is run, and learn some basic grammar structures of programming.

## Goals of the course

In the next 6 weeks, we will explore:

- Fundamental concepts in high-level programming languages (R, Python, Julia, WDL, etc.) that is transferable: *How do programs run, and how do we solve problems using functions and data structures?*

- Beginning of data science fundamentals: *How do you translate your scientific question to a data wrangling problem and answer it?*

![Data science workflow](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){width="450"}
![Data science workflow. Image source: [R for Data Science](https://r4ds.hadley.nz/whole-game).](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){width="450"}

- Find a nice balance between the two throughout the course: we will try to reproduce a figure from a scientific publication using new data.

Expand All @@ -26,34 +30,49 @@ More importantly: **How we organize ideas \<-\> Instructing a computer to do som

- Combining expressions to create more complex expressions

- Encapsulate complex expressions via functions to create modular and reusable tasks

- Encapsulate complex data via data structures to allow efficient manipulation of data
- Encapsulate complex expressions via **functions** to create modular and reusable tasks

- Encapsulate complex data via **data structures** to allow efficient manipulation of data

## Posit Cloud Setup

Posit Cloud/RStudio is an Integrated Development Environment (IDE). Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using R that is easier for the user.
Posit Cloud (the website version of RStudio) is an Integrated Development Environment (IDE). Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using R that is easier for the user.

Let's open up the KRAS analysis in Posit Cloud. If you are taking this course while it is in session, the project name is probably named "KRAS Demo" in your Posit Cloud workspace. If you are taking this course on your own time, open up ["Intro to R Exercises and Solutions" project](https://posit.cloud/content/8245357).

Once you have opened the project, open the file "KRAS_demo.qmd" from the File Browser, and you should see something like this:

![](images/posit.jpg)

Today, we will pay close attention to:

- Script editor: where sequence of instructions are typed and saved as a text document as a R program. To run the program, the console will execute every single line of code in the document.
- R Console (Interpreter): You give it one line of R code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you.

- Script Editor: where many lines of R code are typed and saved as a text document. To run the script, the Console will execute every single line of code in the document. The document you have opened in the script editor is a Quarto Document. A Quarto Document has chunks of plain text *and* R code, which helps us understand better the code we are writing.

- Environment: Often, your code will store information in the Environment, so that information can be reused. For instance, we often load in data and store it in the Environment, and use it throughout rest of your R code.

- Console (interpreter): Instead of giving a entire program in a text file, you could interact with the R Console line by line. You give it one line of instruction, and the console executes that single line. It is what R looks like without RStudio.
The first thing we will do is see the different ways we can run R code. You can do the following:

- Environment: Often, code will store information *in memory*, and it is shown in the environment. More on this later.
1. Type something into the R Console and type enter, such as `2+2`. The R Console will run it and give you an output.
2. Scroll down the Quarto Document, and when you see a chunk of R Code, click the green arrow button. It will copy the R code chunk to the R Console and run all of it. You will likely see variables created in the Environment as you load in and manipulate data.
3. Run every single R code chunk in the Quarto Document by pressing the Run button at the top left corner of the Script Editor. It will generate an output document with all the code run.

## Using Quarto for your work
Remember that the *order* that you run your code matters in programming. Your final product would be the result of Option 3, in which you run every R code chunk from start to finish. However, sometimes it is nice to try out smaller parts of your code via Options 1 or 2. But you will be at risk of running your code out of order!

Why should we use Quarto for data science work?
Quarto is great for data science work, because:

- Encourages reproducible workflows
- It encourages reproducible data analysis, when you run your analysis from start to finish.

- Code, output from code, and prose combined together
- It encourages excellent documentation, as you can have code, output from code, and prose combined together.

- Extensions to Python, Julia, and more.
- It is flexible to other programming languages, such as Python.

More options and guides can be found in [Introduction to Quarto](https://quarto.org/docs/get-started/hello/rstudio.html) .
More options and guides can be found in [Introduction to Quarto](https://quarto.org/docs/get-started/hello/rstudio.html).

###

Now, we will get to the basics of programming grammar.

## Grammar Structure 1: Evaluation of Expressions

Expand Down Expand Up @@ -84,7 +103,7 @@ sum(18, sum(21, 65))

Remember the function machine from algebra class? We will use this schema to think about expressions.

![Function machine from algebra class.](https://cs.wellesley.edu/~cs110/lectures/L16/images/function.png)
![Function machine from algebra class. ](https://cs.wellesley.edu/~cs110/lectures/L16/images/function.png)

If an expression is made out of multiple, nested operations, what is the proper way of the R Console interpreting it? Being able to read nested operations and nested functions as a programmer is very important.

Expand All @@ -95,7 +114,6 @@ If an expression is made out of multiple, nested operations, what is the proper

Lastly, a note on the use of functions: a programmer should not need to know how the function is implemented in order to use it - this emphasizes [abstraction and modular thinking](#a-programming-language-has-following-elements), a foundation in any programming language.


### Data types

Here are some data types that we will be using in this course:
Expand All @@ -106,7 +124,6 @@ Here are some data types that we will be using in this course:

- **Logical**: TRUE, FALSE


## Grammar Structure 2: Storing data types in the environment

To build up a computer program, we need to store our returned data type from our expression somewhere for downstream use. We can assign a variable to it as follows:
Expand Down Expand Up @@ -153,7 +170,6 @@ sqrt(nchar("hello"))
(nchar("hello") + 4) * 2
```


## Tips on writing your first code

`Computer = powerful + stupid`
Expand All @@ -162,14 +178,12 @@ Even the smallest spelling and formatting changes will cause unexpected output a

- Write incrementally, test often

- Check your assumptions, especially using new functions, operations, and new data types.
- Check your assumptions, especially using new functions, operations, and new data types.

- Live environments are great for testing, but not great for reproducibility.
- Live environments are great for testing, but not great for reproducibility.

- Ask for help!


## Exercises

You can find [exercises and solutions on Posit Cloud](https://posit.cloud/content/8245357), or on [GitHub](https://github.com/fhdsl/Intro_to_R_Exercises).

2 changes: 2 additions & 0 deletions 02-data-structures.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Working with data structures

In our second lesson, we start to look at two **data structures**, **vectors** and **dataframes**, that can handle a large amount of data.

## Vectors

In the first exercise, you started to explore **data structures**, which store information about data types. You played around with **vectors**, which is a ordered collection of a data type. Each *element* of a vector contains a data type, and there is no limit on how big a vector can be, as long the memory use of it is within the computer's memory (RAM).
Expand Down
12 changes: 6 additions & 6 deletions 03-data-wrangling1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ library(tidyverse)
load(url("https://github.com/fhdsl/Intro_to_R/raw/main/classroom_data/CCLE.RData"))
```

## Data Science Workflow
From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis.

![](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){width="550"}
![Data science workflow. Image source: [R for Data Science.](https://r4ds.hadley.nz/whole-game)](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){width="550"}

We are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow. We start with *Transform* and *Visualize* with the assumption that our data is in a nice, "Tidy format". First, we need to understand what it means for a data to be "Tidy".
For the rest of the course, we focus on *Transform* and *Visualize* with the assumption that our data is in a nice, "Tidy format". First, we need to understand what it means for a data to be "Tidy".

## Tidy Data

Expand All @@ -27,7 +27,7 @@ If you want to be technical about what variables and observations are, Hadley Wi

> A **variable** contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An **observation** contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
![A tidy dataframe.](https://r4ds.hadley.nz/images/tidy-1.png){width="800"}
![A tidy dataframe. Image source: [R for Data Science](https://r4ds.hadley.nz/data-tidy).](https://r4ds.hadley.nz/images/tidy-1.png){width="800"}

## Examples and counter-examples of Tidy Data:

Expand Down Expand Up @@ -75,8 +75,8 @@ Let's see how these datasets fit the definition of Tidy data:
| Dataframe | The observation is | Some variables are | Some values are |
|------------------|------------------|-------------------|------------------|
| metadata | Cell line | ModelID, Age, OncotreeLineage | "ACH-000001", 60, "Myeloid" |
| expression | | | |
| mutation | | | |
| expression | Cell line | KRAS_Exp | 2.4, .3 |
| mutation | Cell line | KRAS_Mut | TRUE, FALSE |

## Transform: "What do you want to do with this dataframe"?

Expand Down
2 changes: 0 additions & 2 deletions 04-data-wrangling2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -110,8 +110,6 @@ Given `xxx_join(x, y, by = "common_col")`,

## Grouping and summarizing dataframes

Also known as: "The rows I want is described by a column. The columns I want need to be summarized from other columns."

In a dataset, there may be multiple levels of observations, and which level of observation we examine depends on our scientific question. For instance, in `metadata`, the observation is cell lines. However, perhaps we want to understand properties of `metadata` in which the observation is the cancer type, `OncotreeLineage`. Suppose we want the mean age of each cancer type, and the number of cell lines that we have for each cancer type.

This is a scenario in which the *desired rows are described by a column*, `OncotreeLineage`, and the columns, such as mean age, need to be *summarized from other columns.*
Expand Down
15 changes: 10 additions & 5 deletions 05-data-visualization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,25 +9,26 @@ library(tidyverse)
library(palmerpenguins)
```

In our last lesson, we go over common plotting styles and how to create them via a popular R package, Grammar of Graphics (`ggplot`).
In our final to last week together, we learn about how to do visualize our data. There are several different data visualization tools in R, and we focus on one of the most popular, "Grammar of Graphics", or known as "ggplot". The syntax for "ggplot" will look a bit different than the code we have been writing, with syntax such as `ggplot(penguins) + aes(x = bill_length_mm) + geom_histogram()`. The output of all of these functions, such as from `ggplot()` or `aes()` are not data types or data structures that we are familiar with...rather, they are graphical information. You should be worried less about how this syntax is similar to what we have learned in the course so far, but to view it as a new grammar (of graphics!) that you can "layer" on to create more sophisticated plots.

## Common Plots

### Univariate
To get started, we will consider these most simple and common plots:

**Univariate**

- Numeric: histogram

- Character: bar plots

### Bivariate
**Bivariate**

- Numeric vs. Numeric: Scatterplot, line plot

- Numeric vs. Character: Box plot

Why do we focus on these common plots? Our eyes are better at distinguishing certain visual features more than others. All of these plots are focused on their position to depict data, which gives us the most effective visual scale.

![Source: <https://www.oreilly.com/library/view/visualization-analysis-and/9781466508910/K14708_C005.xhtml>](https://www.oreilly.com/api/v2/epubs/9781466508910/files/image/fig5-1.png)
![Image Source: <https://www.oreilly.com/library/view/visualization-analysis-and/9781466508910/K14708_C005.xhtml>](https://www.oreilly.com/api/v2/epubs/9781466508910/files/image/fig5-1.png)

## Grammar of Graphics

Expand All @@ -41,6 +42,8 @@ The syntax of the grammar of graphics breaks down into 4 sections.

[Additional settings]{style="color:purple"}

You add these 4 sections together to form a plot.

### Histogram

[ggplot(penguins)]{style="color:orange"} + [aes(x = bill_length_mm)]{style="color:green"} + [geom_histogram()]{style="color:blue"} + [theme_bw()]{style="color:purple"}
Expand Down Expand Up @@ -160,6 +163,8 @@ ggplot(data = penguins) + geom_point(mapping = aes(x = bill_length_mm, y = bill_

Consider the `esquisse` package to help generate your ggplot code via drag and drop.

An excellent ggplot "cookbook" can be found [here](https://r-graphics.org/).

## Exercises

You can find [exercises and solutions on Posit Cloud](https://posit.cloud/content/8245357), or on [GitHub](https://github.com/fhdsl/Intro_to_R_Exercises).
Binary file added images/posit.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,4 @@ The course is intended for researchers who want to learn coding for the first ti

## Offerings

This course is taught on a regular basis at [Fred Hutch Cancer Center](https://www.fredhutch.org/) through the [Data Science Lab](https://hutchdatascience.org/). Announcements of course offering can be found [here](https://hutchdatascience.org/training/). If you wish to follow the course content asynchronously, you may access the course content on this website and [exercises and solutions on Posit Cloud](https://posit.cloud/content/8245357). The Posit Cloud compute space can be copied to your own workspace for personal use, or you can access the [exercises and solutions on GitHub](https://github.com/fhdsl/Intro_to_R_Exercises).
This course is taught on a regular basis at [Fred Hutch Cancer Center](https://www.fredhutch.org/) through the [Data Science Lab](https://hutchdatascience.org/). Announcements of course offering can be found [here](https://hutchdatascience.org/training/). If you wish to follow the course content asynchronously, you may access the course content on this website and [exercises and solutions on Posit Cloud](https://posit.cloud/content/8245357). The Posit Cloud compute space can be copied to your own workspace for personal use, and you can get started via this [introduction](https://hutchdatascience.org/Intro_to_R/intro-to-computing.html#posit-cloud-setup). Or, you can access the [exercises and solutions on GitHub](https://github.com/fhdsl/Intro_to_R_Exercises).

0 comments on commit cc8d3d7

Please sign in to comment.