Skip to content

Collection of resources I've found to be helpful for data science and programming

Notifications You must be signed in to change notification settings

katiejolly/data-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 

Repository files navigation

Data resources

A list of helpful links and code for R, SQL, Python, HTML/CSS, etc

If you know of a good resource that you want to share, feel free to submit a pull request! If you're not sure how to do that, you can also submit an issue with the link to the resource and a quick intro to why it's helpful for you.

This doc is based on my google doc of resources and inspired by a coversation I had on twitter.

I'm slowly working on coverting the original doc!

Contents

News

GIS and online mapping

Data visualization (mostly point-and-click)

General data viz books, etc

Color specific resources

Literature

Interactive Color Tools

Design Inspiration

Inclusive Design

There is no use having a visualization if folks can't read it.

Rstats

Links

Mentorship

  • Data helpers - A site by Angela Bassa with a list of people willing to help/mentor!

Podcasts

Interviews

Packages to know/learn

Blogs

Books

Style guides

Helpful file functions

file.path() # takes folder names and returns a path to your file

file.choose() # pulls up finder so you can point to your file

download.file(url, dest_path) # makes downloading files reproducible

JSON in R

jsonlite is a good package to use!

  • fromJSON() takes in a JSON file (can be a URL) and returns an R list
  • toJSON() takes in a file and converts it to JSON

Formatting JSON files:

  • minify()
    • {"a":1,"b":2,"c":{"x":5,"y":6}}
  • prettify(), pretty = TRUE by default in toJSON()
    • { "a": 1, "b": 2, "c": { "x": 5, "y": 6 } }

Extracting data from PDFs

ROpenSci has an awesome package to use tabula in R-- tabulizer

Tabula is also awesome on its own.

# example code from github

library("tabulizer")
f <- system.file("examples", "data.pdf", package = "tabulizer")
out1 <- extract_tables(f)

out2 <- extract_tables(f, pages = 1, guess = FALSE, method = "data.frame")

extract_areas() # turns the pdf into an R graphic so you can select the tables on a page, if the whole page isn't a table!

out3 <- extract_text(f, page = 3) # gets just the text from the table, without turning it into a dataframe
cat(out3, sep = "\n")
## len supp dose
## 4.2 VC 0.5
## 11.5 VC 0.5
## 7.3 VC 0.5
## 5.8 VC 0.5
## 6.4 VC 0.5
## 10.0 VC 0.5
## 11.2 VC 0.5
## 11.2 VC 0.5
## 5.2 VC 0.5
## 7.0 VC 0.5
## 16.5 VC 1.0
## 16.5 VC 1.0
## 15.2 VC 1.0
## 17.3 VC 1.0
## 22.5 VC 1.0
## 3

How to install tabulizer:

(On Windows)

From github vignette:

In command prompt, install Chocolately if you don't already have it:

@powershell -NoProfile -ExecutionPolicy Bypass -Command "iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))" && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin

Then install java!

choco install jdk7 -y

To set an environment variable in R use

Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jdk1.8.0_92")

if that doesn't work, look at the other options in the vignette.

Then install the dev version of the package (not yet on CRAN)

if (!require("ghit")) {
    install.packages("ghit")
}
# on 64-bit Windows
ghit::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
ghit::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))

Excel in R

Using the xlsx package to create an excel file

readxl package - tidyverse w/ excel

excel_sheets() # gives a character vector of the sheet names

read_excel() # reads in excel files
# args: sheets (defines which sheet you want)
        col_names = TRUE
        col_types ("blank" to skip a column)
        skip (n rows to skip)

# put all sheets in a list to keep them together

my_workbook <- lapply(excel_sheets("data.xlsx"),
  read_excel,
  path = "data.xlsx")

XLConnect - dynamic connection between R and excel.

createSheet(book, “name”)
loadWorkbook(“path”)
getSheets(book)
writeWorksheet () # function (object, data, sheet, startRow = 1, startCol = 1, header = TRUE,
    rownames = NULL)
writeWorkbook(object, “path”)
renameSheet() # function (object, sheet, newName)
removeSheet() # function (object, sheet)

Googlesheets package <3

googlesheets

There's a vignette with lots of good info.

# uses oauth to access drive

gs_auth() # run to authorize

From the github site, an overview of the functions:

gs_ls() # list sheets
gs_title() # register a sheet by title
gs_key() # register a sheet by key
gs_url() # register a sheet by url

... tbd

Dates and times

strptime to convert character strings to POSIXct.

strptime(x, format, tz = "")

Formatting POSIXct objects:

  • %Y: 4-digit year (1982)
  • %y: 2-digit year (82)
  • %m: 2-digit month (01)
  • %d: 2-digit day of the month (13)
  • %A: weekday (Wednesday)
  • %a: abbreviated weekday (Wed)
  • %B: month (January)
  • %b: abbreviated month (Jan)
  • %H: hours as a decimal number (00-23)
  • %I: hours as a decimal number (01-12)
  • %M: minutes as a decimal number
  • %S: seconds as a decimal number
  • %T: shorthand notation for the typical format %H:%M:%S
  • %p: AM/PM indicator

Example use:

# from r documentation
x <- c("1jan1960", "2jan1960", "31mar1960", "30jul1960")
z <- strptime(x, "%d%b%Y")

Also use lubridate for working with date objects!

There's helpful info on the tidyverse site as well as a nice vignette

Lubridate makes it easy to manipulate and work with dates.

It has some of the same capacity as strptime, although I actually prefer the former.

ymd(20101215)
#> [1] "2010-12-15"
mdy("4/1/17")
#> [1] "2017-04-01"

Lubridate also makes timezones less confusing!

time <- ymd_hms("2010-12-13 15:30:30")
time
#> [1] "2010-12-13 15:30:30 UTC"

# Changes printing
with_tz(time, "America/Chicago")
#> [1] "2010-12-13 09:30:30 CST"

# Changes time
force_tz(time, "America/Chicago")
#> [1] "2010-12-13 15:30:30 CST"

You can also pull out specific parts of a date object.

bday <- dmy("14/10/1979")
month(bday)
#> [1] 10
wday(bday, label = TRUE)
#> [1] Sun
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

Say you want to turn a time into a numerical representation (i.e. 8:30 AM would be 8.50). Use lubridate!

time <- hm("8:30")

hour(time) + minute(time)/60
#> 8.5

Working with nested lists

str() is a helpful starter function for working with nested lists, but we can certainly improve upon it. First, using the max.level argument specifies how much information you want about the list. For example, the output from str(..., max.level=0) is something along the lines of List of x, replace x with the length of your list.

A more interactive way to work with nested lists is with {listviewer}, a package that lets you easily explore your list.

library(listviewer)
jsonedit(df, mode = "view")

The options for the mode argument are code, form, text, tree, and view. Even though you specify the mode, you can easily switch within the interactive list view if you think something else would work better for your particular use.

Dataframes/tables

Conditionals

In R, you can write an if/else statement, similar to the way you would write it in Python. It helps me to think about it this way because I started programming in Python first.

It would look something like this:

if (condition) {
  do something
 } else if (condition) {
  do something
 } else {
  do something
 }

But let's say you want to define a variable in a dataframe based on a conditional. It is way easier with tidyverse packages!

You can use mutate() with ifelse(condition, if-do, else-do)

For example, let's say you want to assign high & low with a conditional based on some values. You could do:

library(dplyr)

df %>%
  mutate(x = ifelse(var > 5, "high", "low"))
  
# this assigns "high" to anything strictly higher than 5

You can also use a nested ifelse() for more complicated conditionals, although that is discouraged. More on that in a bit...

In this example, I wanted the final variable to say "White subjects", "Nonwhite subjects", or "Unknown". I started with two different variables. One said whether or not race was known, and the other gave race, if known.

library(tidyverse)

biopics <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/biopics/biopics.csv")

biopics <- biopics %>% # start with our original data
  mutate(race_plotting = ifelse(race_known == "Unknown", "Unknown", ifelse(subject_race == "White", "White subjects", "Nonwhite subjects"))) # this says, if race_known is unknown, stop there and just mark unknown. But if we know the race, check to see if the race is coded as white. If it is, mark white and stop there. If not, mark nonwhite.

Instead of getting lost in nested statements, try this instead! Found in Github

df <- mutate(df, newvar = derivedFactor(
  "group1" = oldvar %in% c("a", "b", "c"),
  "group2" = oldvar %in% c("d", "e", "f"),
  "group3" = oldvar %in% c("g", "h", "i"),
  .default = NA
))

case_when() is another good alternative that draws on some SQL syntax. Here's an example from Stack Overflow

library(dplyr) # >= 0.7.0
mtcars %>%
  mutate(cg = case_when(carb <= 2 ~ "low",
                        carb > 2  ~ "high"))

ggplot2

ggplot2 gallery

How to change the order of the bars in geom_bar() !!!!

# from https://stackoverflow.com/questions/5208679/order-bars-in-ggplot2-bar-graph

require(forcats)

ggplot(data, aes(x = fct_infreq(var))) + geom_bar()

# from https://rstudio-pubs-static.s3.amazonaws.com/7433_4537ea5073dc4162950abb715f513469.html

x$name <- factor(x$name, levels = x$name[order(x$val)])
ggplot(x, aes(x = name, y = val)) + theme_bw() + geom_bar(stat = "identity")

To look into: gghighlight in this blog post! Looks like an easier way to highlight certain lines/bars instead of manually coding that.

Cookbook of good ggplot2 hacks here!!

GIS in R

# forward geocoding

output <- opencage_forward(placename = "Sarzeau")

# reverse geocoding

output2 <- opencage_reverse(latitude = 51.5034070,
                            longitude = -0.1275920)

# adding parameters

results3 <- opencage_forward(placename = "Berlin", country = "DE")

But for now, I'm using a function I found online. Example:

I wanted to geocode the locations of the Boston community centers recently. I found the names here. Using the chrome extension for CSS selector gadget, I scraped the names into R.

community_centers_url <- read_html("https://www.boston.gov/community-centers") # site with names

centers <- community_centers_url %>%
  html_nodes(".cd-t") %>% # css selector
  html_text() # I want the text

Then I used a combination of functions to geocode.

url <- function(address, return.call = "json", sensor = "false") {
 root <- "http://maps.google.com/maps/api/geocode/"
 u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
 return(URLencode(u))
} # calling google maps API

geoCode <- function(address,verbose=FALSE) {
 if(verbose) cat(address,"\n")
 u <- url(address)
 doc <- getURL(u)
 x <- fromJSON(doc,simplify = FALSE)
 if(x$status=="OK") {
 lat <- x$results[[1]]$geometry$location$lat
 lng <- x$results[[1]]$geometry$location$lng
 location_type <- x$results[[1]]$geometry$location_type
 formatted_address <- x$results[[1]]$formatted_address
 return(c(lat, lng, location_type, formatted_address))
 } else {
 return(c(NA,NA,NA, NA))
 }
} # actually getting the addresses

I'm sure there's a cleaner way to do this, I'll work on using opencage and then updating this page!

Another method of geocoding is using ggmap package and Google API. First, register for a key here. After you get the key, this script will geocode for you (up to the API limit which is around 2500/day)

library(ggmap)
register_google(key = "my_key")

for(i in 1:nrow(turnout_data)){
  #location is a string like "123 University Ave, Gainesville FL"
  #This works best if you format your locations the same way you would for Google Maps/Waze/Tom Tom
  result <- geocode(my_data_frame$location[i], output = "latlona", source = "google")
  my_data_frame$lon[i] <- as.numeric(result[1])
  my_data_frame$lat[i] <- as.numeric(result[2])
  turnout_data$geoAddress[i] <- as.character(result[3])
}

Census data! I love census data

Tidycensus for demographic data.

# median income in Boston census tracts

bos_inc <- get_acs(geography = "tract",
              variables = c(medincome = "B19013_001"),
              state = "MA",
              county = "Suffolk")

# see available variables for 2016 acs 5-year

load_variables(year = 2016, dataset = "acs5")

Tigris to load the shapefiles to map census data!

options(tigris_use_cache = TRUE) # doesn't cache by default
bos_tract <- tracts(state = "MA", county = "Suffolk")

You can join the two with tigris::geo_join. Regular joins don't work with spatial data.

bos_joined <- geo_join(bos_tract, bos_inc, by = "GEOID")

And if I want to exclude a particular tract (it's mostly water, NA for all demographics) that makes my maps looks ugly :(

bos_joined <- subset(bos_joined, NAME.1 != "Census Tract 9901.01, Suffolk County, Massachusetts")

dplyr::filter doesn't work with spatial data!!

Cleaning data

Principles of tidy data:

  • Each row is an observation, each column in an attribute/variable
  • Exactly one type of observational unit per table
  • Column headers should be variable names, not values (ie eye color vs blue)

The tidyverse has understandable and compatible functions for dealing with data cleaning!

Sometimes you have a table that is "wide" or "narrow," and you want it to be the other way. Use spread and gather!

From the RStudio blog:

library(tidyr)
library(dplyr)

messy <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  a = c(67, 80, 64),
  b = c(56, 90, 50)
)
messy
#>      name  a  b
#> 1  Wilbur 67 56
#> 2 Petunia 80 90
#> 3 Gregory 64 50

# We have three variables (name, drug and heartrate), but only name is currently in a column. We use gather() to gather the a and b columns into key-value pairs of drug and heartrate:

# gather(data, key, value, ..., na.rm = FALSE, convert = FALSE,
  factor_key = FALSE)

messy %>%
  gather(drug, heartrate, a:b)
#>      name drug heartrate
#> 1  Wilbur    a        67
#> 2 Petunia    a        80
#> 3 Gregory    a        64
#> 4  Wilbur    b        56
#> 5 Petunia    b        90
#> 6 Gregory    b        50

Conversely, you also might want to spread your data.

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE,
  sep = NULL)

Let's say you have a variable for date-time, but you want one variable for date and the other for time.

In the case of the vignette, they wanted to separate location and time (randomized variables) in the table.

tidier <- messy %>%
  gather(key, time, -id, -trt)
tidier %>% head(8)
#>   id       trt     key    time
#> 1  1 treatment work.T1 0.08514
#> 2  2   control work.T1 0.22544
#> 3  3 treatment work.T1 0.27453
#> 4  4   control work.T1 0.27231
#> 5  1 treatment home.T1 0.61583
#> 6  2   control home.T1 0.42967
#> 7  3 treatment home.T1 0.65166
#> 8  4   control home.T1 0.56774

tidy <- tidier %>%
  separate(key, into = c("location", "time"), sep = "\\.")
tidy %>% head(8)
#>   id       trt location time    time
#> 1  1 treatment     work   T1 0.08514
#> 2  2   control     work   T1 0.22544
#> 3  3 treatment     work   T1 0.27453
#> 4  4   control     work   T1 0.27231
#> 5  1 treatment     home   T1 0.61583
#> 6  2   control     home   T1 0.42967
#> 7  3 treatment     home   T1 0.65166
#> 8  4   control     home   T1 0.56774

The opposite of separate() is unite().

Let's say I have a dataframe where one of the variables is a full address. I want only the rows from Minnesota.

Found in a SO question

df %>%
    filter(str_detect(address, "Minnesota"))
# "detects" the string "Minnesota" in the address column

What if we have NA values and blank cells, both meant to be NA?

Answer found in a SO question

df <- df[!(is.na(df$var) | df$var==""), ]

Janitor

janitor is an awesome data cleaning package. As it says on the github site, it was built with beginners in mind, but can be useful for more experienced R users by making routing processes a little faster. I now routinely add clean_names() every time I load a csv.

read_csv("data.csv") %>%
  clean_names()

In fact, when I'm loading lots of csvs I'll often write my own function that does that automatically. Something like:

clean_read <- function(data) {
  x <- read_csv(data) %>%
    clean_names()
  return(x)
}

If you're dealing with messy data, especially hard coded data, you might see a lot of duplicates. In that case get_dupes() from janitor will be your new best friend!

Modeling in R

General format for linear models in R lm(response ~ explanatory, data = data)

  • Include all variables as explanatory variables lm(response ~ ., data = data)
  • Include all but one (of a few) variables as explanatory variables lm(response ~ . -vars_to_exclude, data = data)

Linear model assumptions

  • Errors are normally distributed
  • Errors have mean zero
    • Errors have constant variance
    • Common fix for non constant variance is log(response)
  • Error terms are independent

David Robinson's broom package gives model output in a "tidy" format.

The tidy() function gives the typical model summary as a tidy data frame-- great for further analysis!

> library(fivethirtyeight)
> data("hate_crimes")
> library(broom)
> mod <- lm(hate_crimes_per_100k_splc ~ gini_index, data = hate_crimes)
> tidy(mod)
         term  estimate std.error statistic    p.value
1 (Intercept) -1.527463 0.7833043 -1.950025 0.05741966
2  gini_index  4.020510 1.7177215  2.340606 0.02374447

Then instead of having to look for the fitted values in the nested structure of mod, you can get them + more in a tidy data frame with augment()!

> head(augment(mod))
  .rownames hate_crimes_per_100k_splc gini_index   .fitted    .se.fit      .resid       .hat    .sigma      .cooksd .std.resid
1         1                0.12583893      0.472 0.3702175 0.04512572 -0.24437860 0.03499099 0.2410643 0.0192795907 -1.0312203
2         2                0.14374012      0.422 0.1691920 0.06752783 -0.02545191 0.07835622 0.2439314 0.0005134118 -0.1098987
3         3                0.22531995      0.455 0.3018689 0.03520103 -0.07654890 0.02129211 0.2436851 0.0011190988 -0.3207498
4         4                0.06906077      0.458 0.3139304 0.03543832 -0.24486962 0.02158013 0.2410927 0.0116131848 -1.0261863
5         5                0.25580536      0.471 0.3661970 0.04407070 -0.11039166 0.03337397 0.2433762 0.0037397389 -0.4654371
6         6                0.39052330      0.457 0.3099099 0.03527586  0.08061342 0.02138273 0.2436546 0.0012466081  0.3377963

Then you can easily add fitted values and residuals to plots that help assess model quality. Fewer steps is good!

Data

General Repositories

Gov open data

Popular press data

Curated lists

Civic tech projects

  • Open elections- creating a central database of US election data at the precinct level. Uses Python + roles for non-coding.
  • Mapping prejudice- digitizing and mapping racial covenants in MPLS home deeds. No coding required!

Open source contributing

General tips + tricks


  • Python

  • HTML/CSS

  • Regex

  • SQL

  • Jekyll + blogging

About

Collection of resources I've found to be helpful for data science and programming

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •