Homework 01: Data Visualization

due Wednesday, September 15 at 11:59p

Goals

In this homework assignment, you will…

Getting started

Packages

We will use the tidyverse package for this assignment. If you wish to use the viridis color palettes, you will need the viridis package as well.

library(tidyverse)
library(ggridges)
library(viridis) #optional

Data: The NC Courage

Today, we will be working with data from the first three full seasons of the NC Courage, a highly successful National Women’s Soccer League (NWSL) Team located near Duke in Cary, NC. The Courage moved to the Triangle from Western New York in 2017 and had three very successful first seasons, culminating in winning the NWSL championship game that was held at their home stadium in Cary in 2019! (Data for this lab was sourced from the nwslR package, and verified with the NC Courage website by Meredith Brown in a previous semester.)

courage <- read_csv("data/courage.csv")

The variables in the dataset are as follows:

Exercises

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

In addition, the code should not exceed the 80 character limit, so that all the code can be read when you knit to PDF. See the Lab 02 instructions for instructions to add a margin line at column 80.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to knit, commit, and push your changes to GithHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

  1. How many rows are in the courage dataset? How many columns? Include the code and resulting output used to support your answer.

  2. Create a bar chart to visualize the distribution of the result of the games. Include a clear title and axis labels. What outcome occurred most frequently?

🧶 ✅ ⬆️ Now is a good time to knit, commit, and push.

  1. Now, let’s examine how the Courage performed in each individual season. Create a stacked bar plot, showing the distribution of result within each season. Your encouraged (but not required) to use the viridis color palette. Include a clear title and axis labels. What are 2 - 3 observations you have from the plot?

  2. Now let’s consider the distribution of points scored by the Courage in a game for all seasons. Make a histogram of the total number of points scored by Courage in a game. Use the histogram to describe the distribution of points scored by Courage.

    To get started use the code below to create two new columns:

    • courage_points: the number of points scored by Courage in a game
    • courage_home: whether or not Courage was the home team (you will use this variable later on).
courage <- courage %>% 
  mutate(courage_points = if_else(home_team == "NC", home_pts, away_pts), 
         courage_home = if_else(home_team == "NC", "Home", "Away"))

🧶 ✅ ⬆️ Now is another good time to knit, commit, and push.

  1. Does Courage have a home field advantage? To explore this question,

    • Create side-by-side box plots of the number of points scored by the Courage based on whether or not they were the home team.
      See the lecture notes and the ggridges vignette for more information and example code.
    • Then create a ridge plot using geom_density_ridges().
    • What do the ridge plots reveal that boxplots do not? What do box plots reveal that ridge plots do not?
  2. Each of Courage’s seasons had 26 games, including playoff games. Do the total number of points scored in a game change over the course of a season? For example, do the total number of points decrease, perhaps due to fatigue, or do they increase over a season as teams get into a groove? To explore this question:

    • Create a new variable for the total number of points scored by both teams in a game, and call it total_points.
    • Then use geom_jitter() to create a scatterplot of the total points versus game number. The function geom_jitter() adds some random noise to the points so they don’t overlap each other.
    • Based on this plot, does there appear to be a general change in total number of points scored over the course of the season? Briefly explain your response.

🧶 ✅ ⬆️ Now is another good time to knit, commit, and push.

  1. Let’s explore if the observations from the previous exercise differ by season. Create a new plot that builds upon the plot from the previous exercise by coloring the points by season and using geom_smooth() to show the general trend for each season. Include the argument se = FALSE to omit the bands around the smoothed curves. Hint: Use as.factor(season) in the ggplot code, so season is treated as a categorical variable.

    • Does there seem to be a difference in the pattern by season? Briefly explain your response.
  2. Now, let’s focus just on points scored by Courage. Make a scatter plot to visualize the relationship between game_number and courage_points, faceted by season.

    • What are 2 observations you have from the plot?

🧶 ✅ ⬆️ Now is another good time to knit, commit, and push.

Submission

Once you are finished with the assignment, you will submit the PDF document produced from your final knit, commit, and push to Gradescope.

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes. Remember – you must turn in a .pdf file to the Gradescope page by the submission deadline to be considered “on time”.

To submit your assignment:

Grading (50 pts)


Component Points
Ex 1 2
Ex 2 6
Ex 3 6
Ex 4 4
Ex 5 6
Ex 6 8
Ex 7 6
Ex 8 6
Workflow & formatting 6

Grading notes: