In this homework assignment, you will…
Go to the sta199-fa21-003 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the assignment.
Clone the repo and start a new project in RStudio. See the Lab 01 instructions for details on cloning a repo and starting a new R project.
We will use the tidyverse package for this assignment. If you wish to use the viridis color palettes, you will need the viridis package as well.
library(tidyverse)
library(ggridges)
library(viridis) #optional
Today, we will be working with data from the first three full seasons of the NC Courage, a highly successful National Women’s Soccer League (NWSL) Team located near Duke in Cary, NC. The Courage moved to the Triangle from Western New York in 2017 and had three very successful first seasons, culminating in winning the NWSL championship game that was held at their home stadium in Cary in 2019! (Data for this lab was sourced from the nwslR package, and verified with the NC Courage website by Meredith Brown in a previous semester.)
<- read_csv("data/courage.csv") courage
The variables in the dataset are as follows:
game_id
: an ID for the game that identifies the teams and the date.game_date
: the date of the gamegame_number
: the order of the game in the season (i.e., 1st, 2nd, etc.)home_team
: the name of the home teamaway_team
: the name of the away teamopponent
: the name of the Courage’s opponenthome_pts
: the number of points scored by the home teamaway_pts
: the number of points scored by the away teamresult
: the outcome of the game from the Courage’s perspectiveseason
: the season the game took place in (i.e., 2017, 2018, 2019)As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
In addition, the code should not exceed the 80 character limit, so that all the code can be read when you knit to PDF. See the Lab 02 instructions for instructions to add a margin line at column 80.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to knit, commit, and push your changes to GithHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
How many rows are in the courage
dataset? How many columns? Include the code and resulting output used to support your answer.
Create a bar chart to visualize the distribution of the result
of the games. Include a clear title and axis labels. What outcome occurred most frequently?
🧶 ✅ ⬆️ Now is a good time to knit, commit, and push.
Now, let’s examine how the Courage performed in each individual season. Create a stacked bar plot, showing the distribution of result
within each season
. Your encouraged (but not required) to use the viridis color palette. Include a clear title and axis labels. What are 2 - 3 observations you have from the plot?
Now let’s consider the distribution of points scored by the Courage in a game for all seasons. Make a histogram of the total number of points scored by Courage in a game. Use the histogram to describe the distribution of points scored by Courage.
To get started use the code below to create two new columns:
courage_points
: the number of points scored by Courage in a gamecourage_home
: whether or not Courage was the home team (you will use this variable later on).<- courage %>%
courage mutate(courage_points = if_else(home_team == "NC", home_pts, away_pts),
courage_home = if_else(home_team == "NC", "Home", "Away"))
🧶 ✅ ⬆️ Now is another good time to knit, commit, and push.
Does Courage have a home field advantage? To explore this question,
geom_density_ridges()
.Each of Courage’s seasons had 26 games, including playoff games. Do the total number of points scored in a game change over the course of a season? For example, do the total number of points decrease, perhaps due to fatigue, or do they increase over a season as teams get into a groove? To explore this question:
total_points
.geom_jitter()
to create a scatterplot of the total points versus game number. The function geom_jitter()
adds some random noise to the points so they don’t overlap each other.🧶 ✅ ⬆️ Now is another good time to knit, commit, and push.
Let’s explore if the observations from the previous exercise differ by season. Create a new plot that builds upon the plot from the previous exercise by coloring the points by season
and using geom_smooth()
to show the general trend for each season. Include the argument se = FALSE
to omit the bands around the smoothed curves. Hint: Use as.factor(season)
in the ggplot code, so season
is treated as a categorical variable.
Now, let’s focus just on points scored by Courage. Make a scatter plot to visualize the relationship between game_number
and courage_points
, faceted by season
.
🧶 ✅ ⬆️ Now is another good time to knit, commit, and push.
Once you are finished with the assignment, you will submit the PDF document produced from your final knit, commit, and push to Gradescope.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes. Remember – you must turn in a .pdf file to the Gradescope page by the submission deadline to be considered “on time”.
To submit your assignment:
Component | Points |
---|---|
Ex 1 | 2 |
Ex 2 | 6 |
Ex 3 | 6 |
Ex 4 | 4 |
Ex 5 | 6 |
Ex 6 | 8 |
Ex 7 | 6 |
Ex 8 | 6 |
Workflow & formatting | 6 |
Grading notes: