Lab 09: Intro to linear regression

due Monday, November 08 at 11:59pm

Learning Goals

In this lab you will…

Getting started

Packages

We will use the tidyverse and tidymodels packages in this assignment.

Intro

Parasites can cause infectious disease – but not all animals are affected by the same parasites. Some parasites are present in a multitude of species and others are confined to a single host. It is hypothesized that closely related hosts are more likely to share the same parasites. More specifically, it is thought that closely related hosts will live in similar environments and have similar genetic makeup that coincides with optimal conditions for the same parasite to flourish.

In this lab we will see how much evolutionary history predicts parasite similarity.

The Data

Today’s dataset comes from an Ecology Letters paper by Cooper at al. (2012) entitled “Phylogenetic host specificity and understanding parasite sharing in primates” located here. The goal of the paper was to identify the ability of evolutionary history and ecological traits to characterize parasite host specificity.

Each row of the data contains two species, species1 and species2.

Subsequent columns describe metrics that compare the species:

The data are available in parasites.csv in the data folder`.

Exercises

Instructions

  • Make sure we see all relevant code and output in the knitted PDF. If you use inline code, make sure we can still see the code used to derive that answer.
  • Write a narrative for each exercise.
  • All narrative should be written in full sentences, and visualizations should have clear title and axis labels.

To get started, load the data and save the data frame as parasites.

  1. Let’s start by examining the relationship between divergence_time and parsim.
    • Based on the goals of the analysis, what is the response variable?
    • Visualize the relationship between the two variables.
    • Use the visualization to describe the relationship between the two variables.
  2. Fit the linear regression model and display the results.
    • Write the regression equation.
    • Interpret the slope and the intercept in the context of the data.
    • Recreate the visualization from Exercise 1, this time adding a regression line to the visualization geom_smooth(method = "lm").
    • What do you notice about the prediction (regression) line that may be strange, particularly for very large divergence times?

This is called a “logit” transformation and takes values between \((0, 1]\) and maps them to \((-\infty, + \infty)\) like we desire while preserving their order.

  1. Since parsim takes values between 0 and 1, we want to transform this variable so that it can range between \((-\infty, + \infty)\). This will be better suited for fitting a regression model (and interpreting predicted values!)

    • Create a new variable transformed_parsim that is calculated as log(parsim/(1-parsim)). Add this variable to your data frame.
    • Then, visualize the relationship between divergence_time and transformed_parsim. Add a regression line to your visualization.
    • Write a 1-2 sentence description of what you observe in the visualization.
  2. Which variable is the strongest individual predictor of parasite similarity between species? To answer this question, begin by fitting a linear regression model to each pair of variables. Do not report the model outputs in a tidy format but save each one as dt_model, dist_model, BM_model and prec_model, respectively.

    • divergence_time and transformed_parsim
    • distance and transformed_parsim
    • BMdiff and transformed_parsim
    • precdiff and transformed_parsim

    Would it be useful to compare the slopes in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?

  3. To compare the explanatory power of each individual predictor, we will look at \(R^2\) between the models. \(R^2\) is a measure of how much of the variability in the response variable is explained by the model (we will talk more about \(R^2\) and the mathematics behind it in an upcoming lecture!).

    As you may have guessed from the name \(R^2\) can be calculated by squaring the correlation (recall the correlation from Lab 01). The correlation \(r\) takes values -1 to 1, therefore, \(R^2\) takes values 0 to 1. Intuitively, if \(r = 1 \text{ or }-1\), then \(R^2 = 1\), indicating the model is a perfect fit for the data. If \(r \approx 0\) then \(R^2 \approx 0\), indicating the model is a very bad fit for the data.

    You can calculate \(R^2\) using the glance function. For example, you can calculate \(R^2\) for dt_model using the code glance(dt_model)$r.squared.

    • Calculate \(R^2\) for each model fit in the previous exercise.
    • Which variable is the strongest predictor of parasite similarity? Briefly explain your choice.

Submission

Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Please only upload your PDF document to Gradescope. Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.

Grading (50 points)

Component Points
Make parasite data frame 2
Ex 1 8
Ex 2 12
Ex 3 8
Ex 4 8
Ex 5 7
Workflow & formatting 5