Skip to content

Latest commit

 

History

History
1358 lines (747 loc) · 16.8 KB

aggregate_by_county.md

File metadata and controls

1358 lines (747 loc) · 16.8 KB

Analyzing download speeds in Kentucky counties

In this tutorial I will talk about how to:

  • Download the Ookla open dataset
  • Geocode the tiles to Kentucky counties
  • Make a table of the top and bottom 20 counties by download speed
  • Map the counties

There are two main ways to join these tiles to another geographic dataset: quadkeys and spatial joins. This tutorial will use the spatial join approach.

library(tigris) # county boundaries
library(tidyverse) # data cleaning
library(sf) # spatial functions
library(knitr)
library(kableExtra) # county statistics table
library(RColorBrewer) # colors
library(here) # file management
library(usethis) # download data

Download data

First, download the data to a local directory by uncommenting that line below

# download the zip folder from s3 and save to working directory
# use_zip("https://ookla-open-data.s3-us-west-2.amazonaws.com/shapefiles/performance/type%3Dfixed/year%3D2020/quarter%3D2/2020-04-01_performance_fixed_tiles.zip")

#read the shapefile. 
tiles <- read_sf(here("2020-04-01_performance_fixed_tiles/gps_fixed_tiles.shp")) %>%
  mutate(avg_d_kbps = as.numeric(avg_d_kbps),
         avg_u_kbps = as.numeric(avg_u_kbps),
         avg_lat_ms = as.numeric(avg_lat_ms))

Get county boundaries

Then, I’ll load the Kentucky county boundaries from the U.S. Census Bureau via tigris.

ky_counties <- tigris::counties(state = "Kentucky") %>%
  select(state_code = STATEFP, geoid = GEOID, name = NAME) %>% # only keep useful variables 
  st_transform(4326) # transform to the same CRS as the tiles

Join tiles to counties

Now I’ll join the tiles to the counties. I use left = FALSE because I only want to include counties that have at least 1 tile.

tiles_in_ky_counties <- st_join(ky_counties, tiles, left = FALSE)

Calculate statistics

Once the datasets are joined, we are interested in summary statistics at the county level. Since we know the average download speed per tile, we could just do a simple average. To make it more accurate, I’ll use a weighted mean, weighted by test count. This gives us the overall mean in the county if the data hadn’t been aggregated to tiles first. I’ve also included weighted means for upload speed and latency here as well.

county_stats <- tiles_in_ky_counties %>%
  st_set_geometry(NULL) %>%
  group_by(state_code, geoid, name) %>%
  summarise(mean_dl_mbps_wt = weighted.mean(avg_d_kbps, tests) / 1000,
            mean_ul_mbps_wt = weighted.mean(avg_u_kbps, tests) / 1000,
            mean_lat_ms_wt = weighted.mean(avg_lat_ms, tests),
            tests = sum(tests)) %>%
  ungroup() %>%
  left_join(fips_codes %>% 
              mutate(geoid = paste0(state_code, county_code)) %>% 
              # get nicer county and state names
              select(state, geoid, long_name = county, county), by = c("geoid")) 

Make a table of the top 20 and bottom 20 counties

Next we can make a summary table of just the best and worst counties. We’ll require that counties have at least 50 tests so that the averages are more reliable. I use kable() here for simplicity, but you could use any of the R packages that help with tables.

table_data <- county_stats %>%
  filter(tests >= 50) %>%
  mutate(rank = min_rank(-mean_dl_mbps_wt)) %>% # rank in descending order
  dplyr::filter(rank <= 20 | rank >= n() - 19) %>%
  mutate(`County` = paste0(long_name, ", ", state),
         mean_dl_mbps_wt = round(mean_dl_mbps_wt, 2)) %>%
  arrange(rank) %>%
  select(`County`, `Average download speed (Mbps)` = mean_dl_mbps_wt, `Tests` = tests, `Rank` = rank)

kable(table_data, format.args = list(big.mark = ","))

County

Average download speed (Mbps)

Tests

Rank

Jefferson County, KY

159.02

72,699

1

Fayette County, KY

154.13

50,730

2

Scott County, KY

151.21

6,632

3

Martin County, KY

150.30

1,489

4

Christian County, KY

146.73

5,365

5

Oldham County, KY

144.92

9,787

6

Gallatin County, KY

144.21

673

7

Madison County, KY

144.17

6,097

8

Clark County, KY

144.07

2,525

9

Bell County, KY

143.69

1,165

10

Pendleton County, KY

141.33

633

11

Warren County, KY

141.17

11,759

12

Grant County, KY

140.04

1,355

13

Woodford County, KY

138.50

5,933

14

Kenton County, KY

137.51

13,883

15

McCreary County, KY

137.23

634

16

Boone County, KY

135.33

15,704

17

Campbell County, KY

134.47

7,576

18

Bullitt County, KY

132.58

8,526

19

Spencer County, KY

132.34

1,872

20

Rockcastle County, KY

39.15

1,083

97

Monroe County, KY

37.63

681

98

Butler County, KY

37.07

407

99

Green County, KY

36.11

915

100

Perry County, KY

35.74

4,348

101

Livingston County, KY

34.60

728

102

Garrard County, KY

34.03

2,252

103

Wayne County, KY

32.78

2,659

104

Breckinridge County, KY

31.20

1,500

105

Harlan County, KY

30.28

2,024

106

Leslie County, KY

29.07

1,303

107

Lyon County, KY

28.46

1,042

108

Knott County, KY

27.19

2,426

109

Letcher County, KY

26.68

3,061

110

Casey County, KY

26.28

727

111

Bracken County, KY

24.30

535

112

Breathitt County, KY

23.16

1,197

113

Lee County, KY

22.90

130

114

Hickman County, KY

20.42

192

115

Hancock County, KY

12.35

359

116

Map the counties

The table is good for a quick glance at overall patterns (what are the overall maxima and minima? where is the fastest speed?) but unless you’re already familiar with these areas it can be hard to picture where they are on a map. To go along with the table we can produce a quick choropleth map that will help give a more visual representation.

We can join our county statistics table to the basemap (remember, we already got rid of the geometry from that county statistics table). I’m also creating a categorical variable from the continuous download speed because people aren’t great at reading continuous color schemes. People can read discrete legends much more easily, with 7 categories maximum (this can depend on the situation, though).

One thing that helps people orient themselves on a map is including major place names. The {tigris} package makes it fairly easy to get a quick list!

set.seed(1) # get the same random sample each time
ky_places <- places(state = "Kentucky") %>%
  filter(PCICBSA == "Y") %>% # principal cities only
  st_centroid() %>%
  mutate(NAME = if_else(NAME == "Louisville/Jefferson County metro government (balance)", "Louisville", NAME)) %>% # shorten the name for Louisville
  sample_n(15) # just get a random 10 places
county_stats_sf <- ky_counties %>%
  select(geoid) %>%
  left_join(county_stats %>% mutate(geoid = as.character(geoid)), by = c("geoid")) %>%
  mutate(mean_dl_mbps_wt = case_when(tests < 50 ~ NA_real_,
                            TRUE ~ mean_dl_mbps_wt)) %>% # at least 50 tests
  mutate(dl_cat = cut(mean_dl_mbps_wt, c(0, 25, 50, 100, 150, 200), ordered_result = TRUE))

ggplot() +
  geom_sf(data = county_stats_sf, aes(fill = dl_cat), color = "gray20", lwd = 0.1) +
  geom_sf_text(data = ky_places, aes(label = NAME), color = "black", size = 3) +
  theme_void() +
  scale_fill_manual(values = brewer.pal(n = 6, name = "BuPu"),  
                    na.value = "gray80", 
                    labels = c("0 to 25", "25.1 to 50", "50.1 to 100", "100.1 to 150", "150.1 to 200", "Insufficient data"), 
                    name = "Mean download speed (Mbps)", 
                    guide = guide_legend(direction = "horizontal", title.position = "top", nrow = 1, label.position = "bottom", keyheight = 0.5, keywidth = 5)) +
  theme(text = element_text(color = "gray25"),
        legend.position = "top")

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] usethis_1.6.1      here_0.1           RColorBrewer_1.1-2 kableExtra_1.1.0  
##  [5] knitr_1.29         sf_0.8-0           forcats_0.5.0      stringr_1.4.0     
##  [9] dplyr_1.0.2        purrr_0.3.4        readr_1.3.1        tidyr_1.1.0       
## [13] tibble_3.0.1       ggplot2_3.3.2      tidyverse_1.3.0    tigris_1.0        
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2         jsonlite_1.7.0     viridisLite_0.3.0  modelr_0.1.8      
##  [5] assertthat_0.2.1   highr_0.8          sp_1.3-2           blob_1.2.1        
##  [9] cellranger_1.1.0   yaml_2.2.1         pillar_1.4.4       backports_1.1.8   
## [13] lattice_0.20-38    glue_1.4.1         uuid_0.1-2         digest_0.6.25     
## [17] rvest_0.3.5        colorspace_1.4-1   htmltools_0.5.0    pkgconfig_2.0.3   
## [21] broom_0.5.6        haven_2.3.1        scales_1.1.1       webshot_0.5.1     
## [25] farver_2.0.3       generics_0.0.2     ellipsis_0.3.1     withr_2.2.0       
## [29] cli_2.0.2          magrittr_1.5       crayon_1.3.4       readxl_1.3.1      
## [33] maptools_0.9-8     evaluate_0.14      fs_1.4.2           fansi_0.4.1       
## [37] nlme_3.1-140       xml2_1.3.2         foreign_0.8-71     class_7.3-15      
## [41] tools_3.6.1        hms_0.5.3          lifecycle_0.2.0    munsell_0.5.0     
## [45] reprex_0.3.0       compiler_3.6.1     e1071_1.7-3        rlang_0.4.7       
## [49] classInt_0.4-2     units_0.6-5        grid_3.6.1         rstudioapi_0.11   
## [53] rappdirs_0.3.1     rmarkdown_2.3      gtable_0.3.0       DBI_1.1.0         
## [57] curl_4.3           R6_2.4.1           lubridate_1.7.9    rgdal_1.4-6       
## [61] rprojroot_1.3-2    KernSmooth_2.23-15 stringi_1.4.6      Rcpp_1.0.3        
## [65] vctrs_0.3.4        dbplyr_1.4.4       tidyselect_1.1.0   xfun_0.15