Learning Goals

In this lab you will…

Continue practicing a collaborative data science workflow
Calculate marginal, joint, and conditional probabilities in a reproducible way
Visualize categorical data
Use visualizations and probabilities to describe the association between two categorical variables.

Getting started

A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta199-fa21-003 course organization on GitHub.
You should see a repo with the *lab-05** prefix.
Each person on the team should clone the repository and open a new project in RStudio. Do not make any changes to the .Rmd file until the instructions tell you do to so.

Workflow: Using git and GitHub as a team

Assign each person on your team a number 1 through 4. For teams of three, Team Member 1 can take on the role of Team Member 4.
The following exercises must be done in order. Only one person should type in the .Rmd file and push updates at a time. When it is not your turn to type, you should still share ideas and contribute to the team’s discussion.

Update YAML

Team Member 1: Change the author to your team name and include each team member’s name in the author field of the YAML in the following format. Team Name: Member 1, Member 2, Member 3, Member 4. Knit, commit, and push the changes to GitHub.

Team Members 2, 3, 4: Click the Pull** button in the Git pane to get the updated document. You should see the updated name in the .Rmd file.**

Packages

We will use the tidyverse and knitr packages in this lab. You can also use the viridis for the visualizations.

library(tidyverse)
library(viridis)
library(knitr)

Data

The General Social Survey (GSS) has been used to measure trends in attitudes and behaviors in American society since 1972. The survey includes demographic information, questions used to gauge attitudes about government spending priorities, confidence in institutions, lifestyle, and many other topics. A full description of the survey may be found here. You will be analyzing data from the 2018 GSS in the lab.

The goal of the lab is to create visualizations and calculate associated probabilities to analyze respondents’ views about industrial air pollution and government spending on alternative energy sources. The variables of interest for this analysis are…

indus: Response to the prompt “In general, do you think that air pollution caused by industry is”.
- Don’t know, Not dangerous, Somewhat dangerous, Very dangerous, Extremely dangerous
altenergy: Response to the prompt, "We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, …, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or about the right amount. Are we spending too much, too little, or about the right amount on developing alternative energy sources?
- Don’t know, Too little, About right, Too much

Use the code below to load the data set.

gss <- read_csv("data/gss2018.csv")

Exercises

For each exercise, show all relevant code and output used to obtain your response. Write all narrative in complete sentences, and use clear axis labels and titles on visualizations.

🧶 ✅ ⬆️ Team Member 1: If you haven’t already, change the author to your team name and include each team member’s name in the author field of the YAML in the following format. Team Name: Team member 1, Team member 2, Team member 3, Team member 4.

Type the team’s response to Exercises 1 - 3.

How many observations are in this dataset? What does each observation represent?
By default, R will arrange the categories of a categorical variable in alphabetical order in any output and visualizations, but we want the levels for indus and altenergy to be in logical order. To achieve this, we will use the factor() function to make both of these variables factors (categorical variables with ordering) and specify the levels we wish to use.

The code to reorder levels for indus is below. Use this code to make indus a factor, and write code to make altenergy a factor with the levels in the following order: “Don’t know”, “Too little”, “About right”, “Too much.” Save your result to the gss data frame, so the reordered variables are used throughout the lab.

gss <- gss %>%
  mutate(indus = factor(indus, levels = c("Not dangerous", "Somewhat dangerous", 
                                          "Very dangerous", 
                                          "Extremely dangerous")))

Then, filter the data only observations with values for indus are included in the data set. Your updated data frame should have 783 observations. You will use this for the remainder of the lab.

Before looking at the relationship between thoughts on impact of industrial air pollution to environment and government spending on alternative energy sources, we’ll look at the distribution of each variable individually.
- Make a bar plot to examine the distribution of indus. Then calculate the marginal probabilities for this variable.
- In general, how do survey respondents feel about the impact of industrial air pollution? Write 1 - 2 observations from the visualization and probabilities to support your response.

🧶 ✅ ⬆️ Team member 1: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 1-3.

Team Member 2: It’s your turn! Type the team’s response to exercises 4 - 5.

Make a bar plot to examine the distribution of altenergy and calculate the marginal probabilities for this variable. In general, how do survey respondents feel about the amount of money the US government is spending on alternative energy sources? Write 1 - 2 observations from the visualization and probabilities to support your response.
What would you expect the relationship between thoughts about the impact of industrial air pollution on the environment and the amount of government spending on alternative energy sources to be?

🧶 ✅ ⬆️ Team member 2: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 4 - 5.

Team Member 3: It’s your turn! Type the team’s response to exercise 6.

Visualize the distribution of thoughts about government spending on alternative energy sources based on levels of thoughts about the impact of industrial air pollution on the environment. Then make a contingency table that can be used to analyze the two variables. Write 2 - 3 observations from the graph and contingency table. (Hint: Use the coord_flip() function to change the axes and make the labels on the plot more readable.)

🧶 ✅ ⬆️ Team member 3: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercise 6.

Team Member 4: It’s your turn! Type the team’s response to exercises 7 - 8.

A randomly chosen respondent thinks government is spending too little on alternative energy sources. What is the probability this individual thinks industrial air pollution is Extremely dangerous? What is the probability this individual thinks it is Very dangerous, Somewhat dangerous or Not dangerous?
Are thoughts on the impact of industrial air pollution to environment and government spending on alternative energy sources independent? Briefly explain, supporting your response based on the analysis from Exercises 6 and 7.

🧶 ✅ ⬆️ Team member 4: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the team’s completed lab!

Wrapping up

Go back through your write up to make sure you followed the coding style guidelines we discussed in class (e.g. no long lines of code).

Submission

Select one team member to upload the team’s PDF submission to Gradescope.
Be sure to select every team member’s name in the Gradescope.
Associate the “Workflow & formatting” graded section with the first page of your PDF, and mark the page associated with each exercise. If any answer spans multiple pages, then mark all pages.

There should only be one submission per team on Gradescope.

Grading

Component	Points
Ex 1	2
Ex 2	5
Ex 3	7
Ex 4	7
Ex 5	3
Ex 6	10
Ex 7	6
Ex 8	4
Workflow & formatting	6

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least one informative commit by each team member, labeling the code chunks, and having readable code in the tidyverse syntax that does not exceed 80 characters (i.e., we can read all your code in the knitted PDF.)

Lab 05: Probability & the General Social Survey

due Thursday, October 7 at 11:59pm