In this lab you will…
tidymodels
framework to build a linear model and estimate regression parametersA repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta199-fa21-003 course organization on GitHub.
You should see a repo with the *lab-08** prefix.
Each person on the team should clone the repository and open a new project in RStudio. Do not make any changes to the .Rmd file until the instructions tell you do to so.
We will use the tidyverse
and tidymodels
packages in this assignment.
Parasites can cause infectious disease – but not all animals are affected by the same parasites. Some parasites are present in a multitude of species and others are confined to a single host. It is hypothesized that closely related hosts are more likely to share the same parasites. More specifically, it is thought that closely related hosts will live in similar environments and have similar genetic makeup that coincides with optimal conditions for the same parasite to flourish.
In this lab we will see how much evolutionary history predicts parasite similarity.
Today’s dataset comes from an Ecology Letters paper by Cooper at al. (2012) entitled “Phylogenetic host specificity and understanding parasite sharing in primates” located here. The goal of the paper was to identify the ability of evolutionary history and ecological traits to characterize parasite host specificity.
Each row of the data contains two species, species1
and species2
.
Subsequent columns describe metrics that compare the species:
divergence_time
: how many (millions) of years ago the two species diverged. i.e. how many million years ago they were the same species.distance
: geodesic distance between species geographic range centroids (in kilometers)BMdiff
: difference in body mass between the two species (in grams)precdiff
: difference in mean annual precipitation across the two species geographic ranges (mm)parsim
: a measure of parasite similarity (proportion of parasites shared between species, ranges from 0 to 1.)The data are available in parasites.csv
in the data
folder`.
To get started, load the data and save the data frame as parasites
.
divergence_time
and parsim
.
geom_smooth(method = "lm")
.This is called a “logit” transformation and takes values between \((0, 1]\) and maps them to \((-\infty, + \infty)\) like we desire while preserving their order.
Since parsim
takes values between 0 and 1, we want to transform this variable so that it can range between \((-\infty, + \infty)\). This will be better suited for fitting a regression model (and interpreting predicted values!)
transformed_parsim
that is calculated as log(parsim/(1-parsim))
. Add this variable to your data frame.divergence_time
and transformed_parsim
. Add a regression line to your visualization.Which variable is the strongest individual predictor of parasite similarity between species? To answer this question, begin by fitting a linear regression model to each pair of variables. Do not report the model outputs in a tidy
format but save each one as dt_model
, dist_model
, BM_model
and prec_model
, respectively.
divergence_time
and transformed_parsim
distance
and transformed_parsim
BMdiff
and transformed_parsim
precdiff
and transformed_parsim
Would it be useful to compare the slopes in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?
To compare the explanatory power of each individual predictor, we will look at \(R^2\) between the models. \(R^2\) is a measure of how much of the variability in the response variable is explained by the model (we will talk more about \(R^2\) and the mathematics behind it in an upcoming lecture!).
As you may have guessed from the name \(R^2\) can be calculated by squaring the correlation (recall the correlation from Lab 01). The correlation \(r\) takes values -1 to 1, therefore, \(R^2\) takes values 0 to 1. Intuitively, if \(r = 1 \text{ or }-1\), then \(R^2 = 1\), indicating the model is a perfect fit for the data. If \(r \approx 0\) then \(R^2 \approx 0\), indicating the model is a very bad fit for the data.
You can calculate \(R^2\) using the glance
function. For example, you can calculate \(R^2\) for dt_model
using the code glance(dt_model)$r.squared
.
Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.
Please only upload your PDF document to Gradescope. Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.
Component | Points |
---|---|
Make parasite data frame | 2 |
Ex 1 | 8 |
Ex 2 | 12 |
Ex 3 | 8 |
Ex 4 | 8 |
Ex 5 | 7 |
Workflow & formatting | 5 |