For #tidytuesday we’re looking at Amusement Park injuries. I plan on making a simple visual of the number of injuries by month.

Libraries

if (!require(pacman)) {install.packages('pacman')} 
p_load(janitor, skimr, stringr, tidyverse, lubridate)

Import Data

Previous inspection of the raw data shows that some NA values are denoted other strings such as “n/a” or “#########”. This does not get picked up as NA in the default condition so must me manually listed.

#Split over multiple lines for legibility
data_url <- paste0("https://raw.githubusercontent.com/rfordatascience/",
                   "tidytuesday/master/data/2019/2019-09-10/",
                   "tx_injuries.csv")

#Define observed N/A types
na_list <- c("NA", "n/a", "#########", "N/A", "na")

#Import Data
tx_injuries <- readr::read_csv(file = data_url, na = na_list)

Data Cleaning / Preparation

Date Correction

There are two date formats used in the data set. One date has a “M/D/Y” format. The other date is represented as a serial number. Both are character strings. To covert the dates to a consistent format and a date object the following steps were taken.

  1. Drop all missing dates.
  2. Use an if/else statement to determine which date format is being processed.
  3. For the “M/D/Y” dates use the mdy() function from lubridate to convert to a date object. Save the date object in a new column using mutate.
  4. Convert the serial date values into a character string date with a “M-D-Y” format. Then use the excel_numeric_to_date from the janitor package to convert to a date object. Save the date object in a new column using mutate, the same column as the other date format from Step 3.
# Consolidate Date Types / Drop Missing Dates

tx_injuries <- tx_injuries %>%
  # Drop N/A Injury dates
  drop_na(injury_date) %>%
  # Unify date type
  mutate(injury_date_conv = if_else(
    # Check if date uses "/"  
    grepl(pattern = "/",x = injury_date),
    # Converts M-D-Y dates
    mdy(injury_date),
    # Converts Serial dates
    excel_numeric_to_date(as.numeric(injury_date)   
                          , date_system = "modern")
    )
  )

Injuries By Month

With a new column with each injury date as a date object, we then sum the number of injuries each month, using group_by with both year and month. For the final visual a dummy day column is added, with date of 1. This day column will be used to create another date object. To create the date object a string is generated by concatenating the year, month, and day columns into a new single column, and then converting this full date string into a date object again using the mdy() function from lubridate.

# Data Frame Development
tx_injuries <- tx_injuries %>% 
  mutate(month = month(injury_date_conv),
         year = year(injury_date_conv)) %>% 
  group_by(year, month) %>% 
  summarise(injuries = n()) %>% 
  mutate(day = 1,
         eff_date_char = paste(year,month,day, sep = "-"),
         eff_date = ymd(eff_date_char)) %>% 
  select(-eff_date_char)

Visual

Now the injuries recorded each month can be plotted. Clear seasonal activity, which probably tracks against total visits.

#Visual
ggplot(data = tx_injuries
       , mapping = aes( x = eff_date, y = injuries)) +
  geom_col(fill = "#1F618D", alpha = 0.75) +
  scale_x_date(
    date_labels = "%Y",
    breaks = "1 year") +
  labs(title = "Number of Injuries at Amusement Parks, By Month"
       , caption = "Data by Data.world | #TidyTuesday") +
  ylab("Injuries") +
  xlab("Year") +
  theme_minimal() +
  theme(axis.text.x =  element_text(hjust=-1.6))