Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Columns in Week 39 Olympiad Data: country_results_df missing total and rank #757

Closed
nicolasfoss opened this issue Sep 25, 2024 · 3 comments

Comments

@nicolasfoss
Copy link

nicolasfoss commented Sep 25, 2024

Hello, Jon:

First, thanks for all the work you put into these. I found a missing column in the country_results_df.

In your README, you show the following columns for country_results_df (see below), but columns 13-14 do not show up in the file we can import for #tidytuesday

country_results_df.csv

column_number variable class description
1 year integer Year of IMO
2 country character Participating country
3 team_size_all integer Participating contestants
4 team_size_male integer
5 team_size_female integer Female contestants
6 p1 integer Score on problem 1
7 p2 integer Score on problem 2
8 p3 integer Score on problem 3
9 p4 integer Score on problem 4
10 p5 integer Score on problem 5
11 p6 integer Score on problem 6
12 p7 integer Score on problem 7
13 total integer Total score on all problems
14 rank integer Country rank
15 awards_gold integer Number of gold medals
16 awards_silver integer Number of silver medals
17 awards_bronze integer Number of bronze medals
18 awards_honorable_mentions integer Number of honorable mentions
19 leader character Leader of country team
20 deputy_leader character Deputy leader of country team

You will notice that the total and rank columns are missing below.

# Load necessary libraries
library(tidyverse)
library(tidytuesdayR)


# Download the data for 2024-09-24 #TidyTuesday
tuesdata <- tidytuesdayR::tt_load('2024-09-24')

# Load the country_results_df dataset
country_results_df <- tuesdata$country_results_df

# Check the structure of the data
country_results_df %>% 
  glimpse()
#> Rows: 3,780
#> Columns: 18
#> $ year                      <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
#> $ country                   <chr> "United States of America", "People's Republ…
#> $ team_size_all             <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
#> $ team_size_male            <dbl> 5, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 5, 5, 5, 6,…
#> $ team_size_female          <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,…
#> $ p1                        <dbl> 42, 42, 42, 42, 42, 42, 42, 42, 42, 38, 42, …
#> $ p2                        <dbl> 41, 42, 37, 34, 30, 37, 33, 37, 25, 37, 33, …
#> $ p3                        <dbl> 19, 31, 18, 11, 10, 7, 8, 16, 5, 5, 6, 2, 3,…
#> $ p4                        <dbl> 40, 40, 42, 42, 42, 42, 42, 36, 42, 42, 42, …
#> $ p5                        <dbl> 35, 22, 7, 28, 36, 29, 31, 23, 35, 12, 23, 1…
#> $ p6                        <dbl> 15, 13, 22, 10, 5, 5, 6, 1, 2, 17, 3, 9, 3, …
#> $ p7                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ awards_gold               <dbl> 5, 5, 2, 4, 4, 1, 2, 2, 1, 2, 2, 1, 3, 1, 2,…
#> $ awards_silver             <dbl> 1, 1, 4, 1, 0, 5, 3, 3, 4, 2, 2, 4, 1, 3, 2,…
#> $ awards_bronze             <dbl> 0, 0, 0, 0, 2, 0, 1, 1, 1, 2, 2, 1, 2, 2, 1,…
#> $ awards_honorable_mentions <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
#> $ leader                    <chr> "John Berman", "Liang Xiao", "Suyoung Choi",…
#> $ deputy_leader             <chr> "Carl Schildkraut", "Yijun Yao", "Hwajong Yo…

Below is a snippet from the cleaning script in the README, where the total and rank columns are not selected within the select() statement.

### NOT RUN {
# country_results_df <- map_df(timeline_df$year, scrape_country) %>% 
#  select( 
#    year,
#    country,
#    team_size_all = team_size,
#    team_size_male = team_size_2,
#    team_size_female = team_size_3,
#    starts_with("p"), # <--- `total` and then `rank` are not included here
#    awards_gold = awards,
#    awards_silver = awards_2,
#    awards_bronze = awards_3,
#    awards_honorable_mentions = awards_4,
#    leader,
#    deputy_leader
#  ) %>% 
#  mutate(
#    across(
#      c(team_size_all:awards_honorable_mentions),
#      as.integer
#    )
#  )}

Here, I will propose a fix, it is simply an update to the select() statement:

# re-run the cleaning script

library(tidyverse)
library(rvest)
library(janitor)
library(httr2)
  
  timeline_df <- read_html("https://www.imo-official.org/organizers.aspx") %>%
    html_table() %>%
    .[[1]] %>%
    clean_names() %>%
    rename(
      "all_contestant" = contestants,
      "male_contestant" = contestants_2,
      "female_contestant" = contestants_3,
      "edition" = number
    ) %>%
    filter(edition != "#") %>%
    mutate(
      start_date = paste0(gsub("(.*)(-)(.*)", "\\1", date),year),
      end_date = paste0(gsub("(.*)(-)(.*)", "\\3", date),year),
      across(
        c(start_date, end_date),
        ~as.Date(.x, format = "%d.%m.%Y")
      ),
      across(
        c(edition, year, countries, all_contestant, male_contestant, female_contestant),
        as.integer
      )
    ) %>%
    select(-date) %>%
    # only keeping records till current year
    filter(year < 2025)
  
  # circulate through country results link and rbind tables
  scrape_country <- function(year) {
    paste0("https://www.imo-official.org/year_country_r.aspx?year=", year) %>%
      read_html() %>%
      html_table() %>%
      .[[1]] %>%
      clean_names() %>%
      filter(country != "Country") %>%
      mutate(year = year, .before = "country") 
  }

After the initial setup, we can manipulate the data and select total and rank along with the rest of the good stuff.

# try an updated select() statement
country_results_df <- map_df(timeline_df$year, scrape_country) %>%
  select(
    year,
    country,
    team_size_all = team_size,
    team_size_male = team_size_2,
    team_size_female = team_size_3,
    starts_with("p"),
    total, # <--- total included
    rank, # <--- rank included
    awards_gold = awards,
    awards_silver = awards_2,
    awards_bronze = awards_3,
    awards_honorable_mentions = awards_4,
    leader,
    deputy_leader
  ) %>% 
  mutate(
    across(
      c(team_size_all:awards_honorable_mentions),
      as.integer
    )
  )
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(c(team_size_all:awards_honorable_mentions),
#>   as.integer)`.
#> Caused by warning:
#> ! NAs introduced by coercion

We can see below that total and rank are included, and we are off to the...math competition!

# check

country_results_df %>% 
  glimpse() # total and rank are included
#> Rows: 3,780
#> Columns: 20
#> $ year                      <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
#> $ country                   <chr> "United States of America", "People's Republ…
#> $ team_size_all             <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
#> $ team_size_male            <int> 5, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 5, 5, 5, 6,…
#> $ team_size_female          <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,…
#> $ p1                        <int> 42, 42, 42, 42, 42, 42, 42, 42, 42, 38, 42, …
#> $ p2                        <int> 41, 42, 37, 34, 30, 37, 33, 37, 25, 37, 33, …
#> $ p3                        <int> 19, 31, 18, 11, 10, 7, 8, 16, 5, 5, 6, 2, 3,…
#> $ p4                        <int> 40, 40, 42, 42, 42, 42, 42, 36, 42, 42, 42, …
#> $ p5                        <int> 35, 22, 7, 28, 36, 29, 31, 23, 35, 12, 23, 1…
#> $ p6                        <int> 15, 13, 22, 10, 5, 5, 6, 1, 2, 17, 3, 9, 3, …
#> $ p7                        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ total                     <int> 192, 190, 168, 167, 165, 162, 162, 155, 151,…
#> $ rank                      <int> 1, 2, 3, 4, 5, 6, 6, 8, 9, 9, 11, 12, 13, 14…
#> $ awards_gold               <int> 5, 5, 2, 4, 4, 1, 2, 2, 1, 2, 2, 1, 3, 1, 2,…
#> $ awards_silver             <int> 1, 1, 4, 1, 0, 5, 3, 3, 4, 2, 2, 4, 1, 3, 2,…
#> $ awards_bronze             <int> 0, 0, 0, 0, 2, 0, 1, 1, 1, 2, 2, 1, 2, 2, 1,…
#> $ awards_honorable_mentions <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
#> $ leader                    <chr> "John Berman", "Liang Xiao", "Suyoung Choi",…
#> $ deputy_leader             <chr> "Carl Schildkraut", "Yijun Yao", "Hwajong Yo…

Thanks again for all your work on the #tidytuesday project, it is such a great opportunity to learn and grow with others.

Created on 2024-09-25 with reprex v2.1.1

@nicolasfoss nicolasfoss changed the title Missing Column in Week 39 Olympiad Data: country_results_df missing rank Missing Columns in Week 39 Olympiad Data: country_results_df missing total and rank Sep 25, 2024
@jonthegeek
Copy link
Collaborator

Since people have already been using the data for a couple days, I remove the column from the dictionary. Thanks for the catch, but changing things can break a lot of people's code, so we try to leave things alone as much as possible once a dataset is released!

@nicolasfoss
Copy link
Author

@jonthegeek Sounds good, I will just get that added on my end and use in my analysis.

@nicolasfoss
Copy link
Author

@jonthegeek Sorry to be a bother, but I did see you pushed a fix to the readme.md to remove rank, you might also remove total as I updated my issue a few times as I caught that one as well. I just want to be helpful and am thankful for your responsiveness. Have a good one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants