NCAA Men's Volleyball Rosters

November 13, 2020
rbokeh tidyverse shiny NCAA rosters rvest tidygeocoder

NCAA Men’s Volleyball Rosters

After a bit of a hiatus, I’m back with a project I’ve been working on recently - accumulating men’s volleyball rosters from team sites. This exercise has been a good introduction to web scraping using the {rvest} package as well as data cleaning, particularly using the unite() and separate() functions from {tidyr}, working with strings using str_squish() from {stringr} and gsub(). During this process, I started to get curious about plotting points with maps as to visualize where college men’s volleyball players were coming from. With the help of the {tidygeocoder} package, I was able to match lattitudes and longitudes to hometowns then stumbled across the {rbokeh} package for interactive map plotting. Having wanted to share a Shiny App with publicly available data for some time now, this felt like the right opportunity to do so. While this is not an exhaustive list of rosters by any means, I was happy with the amount of rosters (NCAA DI & DII to start) I was able to pull from team websites to get this off the ground. If anyone has advice or recommendations with expanding this roster list into a full fledge database, I believe it would be worthwhile to maintain this list to track the growth of boys’/men’s volleyball in the US.

The Data

Gathering and cleaning the data was (is) certainly the most tedious portion of this project, but it gave me some good reps using {rvest} and learning how to work with strings using {stringr}. I’ll go through an example of pulling one team’s rosters from their website for anyone just starting to get into web scraping (like myself!). We’ll start with UCLA.

First, we need to investigate UCLA’s Men’s Volleyball roster URL and see if we can identify a pattern with how each year’s rosters are listed. Navigating to the most recent roster (2020 at the time this post is written) is found at https://uclabruins.com/sports/mens-volleyball/roster/.

Here, we can select through rosters provided back to 2010. Selecting a previous year reveals the pattern of the urls to be appending the year to the base url such as https://uclabruins.com/sports/mens-volleyball/roster/2018. Let’s scrape UCLA’s rosters with this information.

# load necessary packages
library(tidyverse)
library(rvest)

# years to pull from
years <- 2010:2020

# create a vector of urls to scrape rosters from
urls <- purrr::map_chr(years, ~ paste0("https://uclabruins.com/sports/mens-volleyball/roster/",.x))

# iterate scraping rosters over the urls and combine them as a data frame
ucla <- purrr::map_dfr(urls,
                       
                       # use functions from rvest/xml2 to read in the html
                       ~ read_html(.x) %>%
  
                         # identify html nodes with the <table> tag
                         html_nodes("table") %>%
                         
                         # from the <table> nodes we gathered, the third item contains
                         # the table with student-athlete rosters we want
                         .[[3]] %>%
                         
                         # parse the html table into a data frame
                         html_table(),
                       
                       # create a column named `year_index` to label each table
                       .id = "year_index") %>%
  
  # convert the `year_index` id label to an integer
  dplyr::mutate(year_index = as.integer(year_index),
                
                # match the year of the roster to `year_index`
                year = years[year_index],
                school = "UCLA",
                school_code = "ucla")

# take a look at what we scraped!
ucla %>% glimpse()
## Rows: 217
## Columns: 13
## $ year_index               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ `#`                      <int> 1, 2, 3, 4, 5, 7, 8, 10, 14, 15, 16, 17, 1...
## $ `Full Name`              <chr> "Cooper O'Connor", "Mitchel Johnson", "Tom...
## $ Pos                      <chr> "S", "OH", "L", "OH", "OH/L", "OH", "S", "...
## $ Ht.                      <chr> "6-2", "6-6", "6-0", "6-5", "6-5", "6-5", ...
## $ `Academic Year`          <chr> "R-Sr.", "So.", "Jr.", "Fr.", "R-Jr.", "Sr...
## $ `Hometown / High School` <chr> "Long Beach, CA / Los Alamitos", "Manhatta...
## $ Pos.                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Wt.                      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Yr.                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ year                     <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...
## $ school                   <chr> "UCLA", "UCLA", "UCLA", "UCLA", "UCLA", "U...
## $ school_code              <chr> "ucla", "ucla", "ucla", "ucla", "ucla", "u...

Check out https://github.com/tidyverse/rvest for more information on using rvest for web scraping.

A quick glimpse at our new data frame shows what we were able to scrape from these urls. Let’s clean this up a bit with the relevant pieces of information we’ll want to show in our map.

ucla %>%
  
  # remove columns we don't need going forward
  dplyr::select(-year_index,-`#`,-`Wt.`,-`Yr.`,-`Academic Year`) %>%
  
  # combine values for Pos and Pos. into one column
  tidyr::unite(c(Pos,`Pos.`),      # columns to unite
               col = "position",   # new column name
               na.rm = T) %>%      # prevent NA values from combining
  
  # separate hometown and high school into two columns
  tidyr::separate(col = `Hometown / High School`,    # select column to separate
                  into = c("hometown","highschool"), # new column names to separate into
                  sep = "/") %>%                     # separator between columns
  
  # separate hometown into hometown and state
  tidyr::separate(col = `hometown`,
                  into = c("hometown","state"),
                  sep = ",") %>%
  
  # rename columns
  dplyr::rename(name = `Full Name`,
                height = `Ht.`) %>%
  
  # clean up character values with extraneous spaces with stringr::str_squish then assign to `ucla`
  dplyr::mutate(across(.cols = c(name,position,hometown,state,highschool),
                       .fns = str_squish)) -> ucla

# take a look
ucla %>% glimpse()
## Rows: 217
## Columns: 9
## $ name        <chr> "Cooper O'Connor", "Mitchel Johnson", "Tom Hastings", "...
## $ position    <chr> "S", "OH", "L", "OH", "OH/L", "OH", "S", "QH", "Opp", "...
## $ height      <chr> "6-2", "6-6", "6-0", "6-5", "6-5", "6-5", "6-8", "6-5",...
## $ hometown    <chr> "Long Beach", "Manhattan Beach", "Yorba Linda", "Zichro...
## $ state       <chr> "CA", "CA", "CA", "Israel", "CA", "CA", "CA", "CA", "IL...
## $ highschool  <chr> "Los Alamitos", "Mira Costa", "Esperanza", "Hof Hashron...
## $ year        <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2...
## $ school      <chr> "UCLA", "UCLA", "UCLA", "UCLA", "UCLA", "UCLA", "UCLA",...
## $ school_code <chr> "ucla", "ucla", "ucla", "ucla", "ucla", "ucla", "ucla",...

While I won’t go through it here, I did spend some more time cleaning up states and positions, adding a country variable, and using the tidygeocoder package to obtain latitude and longitude data for each city. The most up to date version of the clean roster can be found on my GitHub page.

Shiny App

Now that we have some clean data with latitude and longitudes, let’s build an interactive Shiny App to see where NCAA Men’s Volleyball athletes are coming from. RStudio’s Shiny website is a great resource to get started if this is your first go at writing a Shiny App. The gallery contains a lot of great examples and code to help you get started. Again, this app can be found on my GitHub page as linked above. I’ll go through the code with comments here.

# load necessary packages
library(tidyverse)
library(rbokeh)   # interactive map plotting
library(shiny)

# read in the data
data0 <- readr::read_rds("ncaa_roster.rds")

# Define UI for application that draws a histogram
ui <- fluidPage(
    
  # add title
  titlePanel(
    "NCAA Men's Volleyball - Division I-II Historical Rosters"
    
  ), # close titlePanel
  
  sidebarLayout(
    
    # add a sidebarPanel to house an update button and filters
    sidebarPanel(
      width = 2,
      
      # create the update button
      actionButton(inputId = "update",
                   label = "Update"),
      
      # add some white space between the bottom of the button and the first filter
      br(),
      br(),
      
      # selector field for year
      selectizeInput(
        
        # inputId to refer back to in the server function below
        inputId = "year",
        
        # label as seen when the app is run
        label = "Year",
        
        # define available choices to filter by
        choices = unique(
          data0 %>%
            dplyr::arrange(desc(year)) %>%
                        dplyr::pull(year)),
                selected = 2021,
        
        # allow multiple years to be selected
        multiple = TRUE
        
      ), # close selectizeInput
      
      # selector field for school
      selectizeInput(
        inputId = "school",
        label = "School",
        choices = unique(
          data0 %>%
            dplyr::arrange(school) %>%
            dplyr::pull(school)),
        multiple = TRUE
        
      ), # close selectizeInput
      
      # selector field for All-American status
      selectizeInput(
        inputId = "all_american",
        label = "All-American",
        choices = unique(
          data0 %>%
            dplyr::arrange(all_american) %>%
            dplyr::pull(all_american)),
        multiple = TRUE
        
      ), # close selectizeInput
      
      # selector field for All-Conference status
      selectizeInput(
        inputId = "all_conference",
        label = "All-Conference",
        choices = unique(
          data0 %>%
            dplyr::arrange(all_conference) %>%
            dplyr::pull(all_conference)),
        multiple = TRUE
        
      ), # close selectizeInput
      
      # selector field for home state
      selectizeInput(
        inputId = "state",
        label = "Home State",
        choices = unique(
          data0 %>%
            dplyr::arrange(state) %>%
            dplyr::pull(state)),
        multiple = TRUE
        
      ), # close selectizeInput
      
      # selector field for home country
      selectizeInput(
        inputId = "country",
        label = "Home Country",
        choices = unique(
          data0 %>%
            dplyr::arrange(country) %>%
            dplyr::pull(country)),
        multiple = TRUE
        
      ), # close selectizeInput
      
    ), # close sidebarPanel
    
    # define the main panel (where the map will go)
    mainPanel(
      width = 10,
      
      # call the map from the server function below and define map size
      rbokehOutput(
        outputId = "map",
        width = 1000,
        height = 1000
        
      ) # close rbokehOutput
    ) # close mainPanel
  ) # close sidebarLayout
) # close fluidPage

# Define server logic required to draw map
server <- function(input, output) {
  
  # define map output
  output$map <- renderRbokeh({
    
    # trigger update via update action button
    input$update
    
    # define map figure
    rbokeh::figure(
      width = 1440,
      height = 900) %>%
      ly_map("world",col = "gray") %>%
      ly_points(long, lat, 
                data = data0 %>%
                  
                  # filter by given inputs from ui function above
                  # isolate() prevents the output from reloading until the
                  # update actionbutton is triggered
                  dplyr::filter(year %in% isolate(input$year) |
                                  
                                  # is.null to call all data with no filters
                                  is.null(isolate(input$year))) %>%
                  dplyr::filter(school %in% isolate(input$school) |
                                  is.null(isolate(input$school))) %>%
                  dplyr::filter(all_american %in% isolate(input$all_american) |
                                  is.null(isolate(input$all_american))) %>%
                  dplyr::filter(all_conference %in% isolate(input$all_conference) |
                                  is.null(isolate(input$all_conference))) %>%
                  dplyr::filter(state %in% isolate(input$state) |
                                  is.null(isolate(input$state))) %>%
                  dplyr::filter(country %in% isolate(input$country) |
                                  is.null(isolate(input$country))) %>%
                  
                  # group and summarise to prevent duplicate names over multiple years
                  dplyr::group_by(name,hometown,state,country,school,lat,long) %>%
                  dplyr::summarise(.groups = "drop"),
                hover = c(name, school, hometown, state, country)) %>%
      x_axis(label = "", grid = F, visible = F) %>%
      y_axis(label = "", grid = F, visible = F)
    
  })
}

# Run the application 
shinyApp(ui = ui, server = server)


Link to this app hosted on shinyapps.io can be found here: https://natengo1.shinyapps.io/ncaa_men_d1_d2/ or download the code and run it locally on your machine.

All in all, this was a fun, side project that I will look to build upon/maintain tracking rosters over time. At the time of writing this, I have started accumulating some DIII rosters and will update this app to reflect those additions in time. I will happily accept any feedback and/or assistance in adding more rosters going forward.

On Service Errors

February 10, 2023
ggplot2 tidyverse serve errors lm linear regression parsnip

Expected Kills

August 28, 2020
ggplot2 tidyverse tidymodels xK expected kills

High School Boys' Volleyball

August 12, 2020
ggplot2 tidyverse nfhs high school boys volleyball first point foundation
comments powered by Disqus