Inference using the Central Limit Theorem

Introduction to Data Science @ Duke

introds.org

1 / 25

The Central Limit Theorem

For a population with a well-defined mean $μ$ and standard deviation $σ$ , these three properties hold for the distribution of sample average $\bar{X}$ , assuming certain conditions hold:

✅ The distribution of the sample statistic is nearly normal

✅ The distribution is centered at the (often unknown) population parameter

✅ The variability of the distribution is inversely proportional to the square root of the sample size

2 / 25

Why do we care?

Knowing the distribution of the sample statistic $\bar{X}$ can help us

3 / 25

Why do we care?

Knowing the distribution of the sample statistic $\bar{X}$ can help us

estimate a population parameter as point estimate $\pm$ margin of error
- the margin of error is comprised of a measure of how confident we want to be and how variable the sample statistic is

3 / 25

Why do we care?

Knowing the distribution of the sample statistic $\bar{X}$ can help us

estimate a population parameter as point estimate $\pm$ margin of error
- the margin of error is comprised of a measure of how confident we want to be and how variable the sample statistic is

test for a population parameter by evaluating how likely it is to obtain to observed sample statistic when assuming that the null hypothesis is true
- this probability will depend on how variable the sampling distribution is

3 / 25

Inference based on the CLT

4 / 25

Inference based on the CLT

If necessary conditions are met, we can also use inference methods based on the CLT. Suppose we know the true population standard deviation, $σ$ .

5 / 25

Inference based on the CLT

If necessary conditions are met, we can also use inference methods based on the CLT. Suppose we know the true population standard deviation, $σ$ .

Then the CLT tells us that $\bar{X}$ approximately has the distribution $N (μ, σ / \sqrt{n})$ .

That is,

$Z = \frac{\bar{X} - μ}{σ / \sqrt{n}} \sim N (0, 1)$

5 / 25

What if $σ$ isn't known?

6 / 25

T distribution

In practice, we never know the true value of $σ$ , and so we estimate it from our data with $s$ .

We can make the following test statistic for testing a single sample's population mean, which has a t-distribution with n-1 degrees of freedom:

$T = \frac{\bar{X} - μ}{s / \sqrt{n}} \sim t_{n - 1}$

7 / 25

T distribution

The t-distribution is also unimodal and symmetric, and is centered at 0

8 / 25

T distribution

The t-distribution is also unimodal and symmetric, and is centered at 0
It has thicker tails than the normal distribution
- This is to make up for additional variability introduced by using $s$ instead of $σ$ in calculation of the SE

8 / 25

T distribution

The t-distribution is also unimodal and symmetric, and is centered at 0
It has thicker tails than the normal distribution
- This is to make up for additional variability introduced by using $s$ instead of $σ$ in calculation of the SE
It is defined by the degrees of freedom

8 / 25

T vs Z distributions

9 / 25

T distribution

Finding probabilities under the t curve:

#P(t < -1.96)
pt(-1.96, df = 9)

## [1] 0.0408222

#P(t > -1.96)
pt(-1.96, df = 9, 
   lower.tail = FALSE)

## [1] 0.9591778

10 / 25

T distribution

Finding probabilities under the t curve:

#P(t < -1.96)
pt(-1.96, df = 9)

## [1] 0.0408222

#P(t > -1.96)
pt(-1.96, df = 9, 
   lower.tail = FALSE)

## [1] 0.9591778

Finding cutoff values under the t curve:

# Find Q1
qt(0.25, df = 9)

## [1] -0.7027221

# Q3
qt(0.75, df = 9)

## [1] 0.7027221

10 / 25

Resident satisfaction in Durham

durham_survey contains resident responses to a survey given by the City of Durham in 2018. These are a randomly selected, representative sample of Durham residents.

Questions were rated 1 - 5, with 1 being "highly dissatisfied" and 5 being "highly satisfied."

11 / 25

Resident satisfaction in Durham

durham_survey contains resident responses to a survey given by the City of Durham in 2018. These are a randomly selected, representative sample of Durham residents.

Questions were rated 1 - 5, with 1 being "highly dissatisfied" and 5 being "highly satisfied."

Is there evidence that, on average, Durham residents are generally satisfied (score greater than 3) with the quality of the public library system?

11 / 25

Exploratory Data Analysis

durham <- read_csv("data/durham_survey.csv") %>%
  filter(quality_library != 9)

durham %>% 
  summarise(x_bar = mean(quality_library), 
            med = median(quality_library), 
            sd = sd(quality_library), 
            n = n())

## # A tibble: 1 × 4
##   x_bar   med    sd     n
##   <dbl> <dbl> <dbl> <int>
## 1  3.97     4 0.900   521

12 / 25

Hypotheses

What are the hypotheses for evaluating if Durham residents, on average, are generally satisfied with the public library system?

13 / 25

Hypotheses

What are the hypotheses for evaluating if Durham residents, on average, are generally satisfied with the public library system?

$H_{0} : μ = 3$ $H_{a} : μ > 3$

13 / 25