Quantifying uncertainty

# Quantifying uncertainty
## Introduction to Data Science @ Duke
### <a href="https://www.introds.org/">introds.org</a>

---

layout: true
 
<div class="my-footer">

<a href="https://introds.org" target="_blank">introds.org</a>

</div>

---

# Inference

---

## Terminology

**Population**: a group of individuals or objects we are interested in studying

**Parameter**: a numerical quantity derived from the population
(almost always unknown)

If we had data from every unit in the population, we could just calculate 
population parameters and be done!

Unfortunately, we usually cannot do this, so we draw conclusions from

**Sample**: a subset of our population of interest

**Statistic**: a numerical quantity derived from a sample

---

## Inference

If the sample is **representative**, then we can use the tools of probability and statistical inference to make .vocab[generalizable] conclusions to the broader population of interest.

Similar to tasting a spoonful of soup while cooking to make an inference about the entire pot.

---

## Statistical inference

**Statistical inference** is the process of using sample data to make 
  conclusions about the underlying population the sample came from.

- **Estimation**: using the sample to estimate a plausible range of values for the unknown parameter

- **Testing**: evaluating whether our observed sample provides evidence 
for or against some claim about the population

Today we will focus on **estimation**.

---

# Estimation

---

## Let's \*virtually\* go to Asheville!

**How much should we expect to pay for an Airbnb in Asheville?**
]

---

## Asheville data

[Inside Airbnb](http://insideairbnb.com/) scraped all Airbnb listings in 
Asheville, NC, that were active on June 25, 2020.

**Population of interest**: listings in the Asheville with at least ten reviews.

**Parameter of interest**: Mean price per guest per night among these 
listings.

.question[
What is the mean price per guest per night among Airbnb rentals in June 2020, 
among Airbnbs with at least ten reviews in Asheville (zip codes 28801 - 28806)?
]

We have data on the price per guest (`ppg`) for a random
sample of 50 Airbnb listings.

---

## Point estimate

A **point estimate** is a single value computed from the sample data to serve
as the "best guess", or estimate, for the population parameter.

```r
abb <- read_csv("data/asheville.csv")

abb %>% 
  summarize(mean_price = mean(ppg))
```

```
## # A tibble: 1 × 1
## mean_price
## <dbl>
## 1 76.6
```

---

## Visualizing our sample

---

.pull-left[
<img src="img/14/spear.png" width="400" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/14/net.png" width="400" style="display: block; margin: auto;" />
]

---

.question[
If you want to estimate a population parameter, do you prefer to report a range 
of values the parameter might be in, or a single value?
]

.pull-left[
<img src="img/14/spear.png" width="400" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/14/net.png" width="400" style="display: block; margin: auto;" />
]

---

.question[
If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?
]

- If we report a point estimate, we probably won’t hit the exact population 
parameter.

- If we report a range of plausible values we have a good shot at capturing 
the parameter.

---

.footnote[
Source: [Biden vs Trump: who is leading the 2020 US election polls?](https://ig.ft.com/us-election-2020/), 10 Sep 2020.
]

---

## Confidence intervals

---

## Confidence intervals

- A plausible range of values for the population parameter is a **confidence interval**.

- In order to construct a confidence interval we need to quantify the variability of our sample statistic

- For example, if we want to construct a confidence interval for a population mean, we need to come up with a plausible range of values around our observed sample mena

- This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean

- Quantifying this requires a measurement of how much we would expect the sample population to vary from sample to sample

---

.question[
Suppose we split the class in half and ask each student their height. Then, we calculate the mean height of students 
on each side of the classroom. Would you expect these two means to be exactly 
equal, close but not equal, or wildly different?
]

.question[
Suppose you randomly sample 50 students and 5 of them are left handed. If you 
were to take another random sample of 50 students, how many would you expect to 
be left handed? Would you be surprised if only 3 of them were left handed? Would 
you be surprised if 40 of them were left handed?
]

---

## Quantifying the variability

We can quantify the variability of sample statistics using different approaches:

- **Simulation**: via bootstrapping or "resampling" techniques (**today's focus**)

- **Theory**: via the Central Limit Theorem (**coming soon!**)

---

# Bootstrapping

---

## The bootstrap principle

- The term **bootstrapping** comes from the phrase "pulling oneself up by one’s 
bootstraps", which is a metaphor for accomplishing an impossible task without 
any outside help.

- **Impossible task**: estimating a population parameter using data from only the given sample.

- **Note**: This notion of saying something about a population parameter using 
only information from an observed sample is the crux of statistical inference,  it is not limited to bootstrapping.

---

## The bootstrap procedure

1. Take a **bootstrap sample** - a random sample taken * with replacement* from the original sample, *of the same size* as the original sample.

2. Calculate the bootstrap statistic: the statistic you’re interested in (the 
mean, the median, the correlation, etc.) computed on the bootstrap sample.

3. Repeat steps (1) and (2) many times to create a .vocab[bootstrap distribution] - a distribution of bootstrap statistics.

4. Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution.

---

## The original sample

---

## Step-by-step

**Step 1.** Take a **bootstrap sample**: a random sample taken 
*with replacement* from the original sample, *of the same size* as the 
original sample:

---

## Step-by-step

**Step 2.** Calculate the bootstrap statistic (in this case, the sample mean) 
using the bootstrap sample:

---

## Step-by-step

**Step 3.** Do steps 1 and 2 over and over again to create a bootstrap 
distribution of sample means:

]

]

<img src="14-bootstrap_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" />
 
---

## Step-by-step

**Step 3.** In this plot, we've taken 500 bootstrap samples, calculated the
sample mean for each, and plotted them in a histogram:

---

**Here we compare the bootstrap distribution of sample means to that 
of the original data. What do you notice?**

---

## Step-by-step

**Step 4.** Calculate the bounds of the bootstrap interval by using percentiles of the bootstrap distribution

---

## Interpreting a confidence interval

Using the 2.5th and 97.5th quantiles as bounds for our confidence interval gives 
us the middle 95% of the bootstrap means. Our 95% CI is 
(65.08, 89.42). What does this interval tell us?

---

## Interpretation

.question[
The 95% confidence interval is 65.08 to 89.42. What is the correct interpretation for this interval? 
]

**A** There is a 95% probability the mean price per night for an Airbnb in Asheville is between 65.08 and 89.42.

**B** There is a 95% probability the price per night for an Airbnb in Asheville is between 65.08 and 89.42.

**C** We are 95% confident the mean price per night for Airbnbs in Asheville is between 65.08 and 89.42.

**D** We are 95% confident the price per night for an Airbnb in Asheville is between 65.08 and 89.42.