+ - 0:00:00
Notes for current slide
Notes for next slide

Inference using the Central Limit Theorem



Introduction to Data Science @ Duke

introds.org

1 / 25

The Central Limit Theorem

For a population with a well-defined mean μ and standard deviation σ, these three properties hold for the distribution of sample average X¯, assuming certain conditions hold:

✅ The distribution of the sample statistic is nearly normal

✅ The distribution is centered at the (often unknown) population parameter

✅ The variability of the distribution is inversely proportional to the square root of the sample size

2 / 25

Why do we care?

Knowing the distribution of the sample statistic X¯ can help us

3 / 25

Why do we care?

Knowing the distribution of the sample statistic X¯ can help us

  • estimate a population parameter as point estimate ± margin of error
    • the margin of error is comprised of a measure of how confident we want to be and how variable the sample statistic is


3 / 25

Why do we care?

Knowing the distribution of the sample statistic X¯ can help us

  • estimate a population parameter as point estimate ± margin of error
    • the margin of error is comprised of a measure of how confident we want to be and how variable the sample statistic is


  • test for a population parameter by evaluating how likely it is to obtain to observed sample statistic when assuming that the null hypothesis is true
    • this probability will depend on how variable the sampling distribution is
3 / 25

Inference based on the CLT

4 / 25

Inference based on the CLT

If necessary conditions are met, we can also use inference methods based on the CLT. Suppose we know the true population standard deviation, σ.

5 / 25

Inference based on the CLT

If necessary conditions are met, we can also use inference methods based on the CLT. Suppose we know the true population standard deviation, σ.

Then the CLT tells us that X¯ approximately has the distribution N(μ,σ/n).

That is,

Z=X¯μσ/nN(0,1)

5 / 25

What if σ isn't known?

6 / 25

T distribution

In practice, we never know the true value of σ, and so we estimate it from our data with s.

We can make the following test statistic for testing a single sample's population mean, which has a t-distribution with n-1 degrees of freedom:

T=X¯μs/ntn1

7 / 25

T distribution

  • The t-distribution is also unimodal and symmetric, and is centered at 0
8 / 25

T distribution

  • The t-distribution is also unimodal and symmetric, and is centered at 0

  • It has thicker tails than the normal distribution

    • This is to make up for additional variability introduced by using s instead of σ in calculation of the SE
8 / 25

T distribution

  • The t-distribution is also unimodal and symmetric, and is centered at 0

  • It has thicker tails than the normal distribution

    • This is to make up for additional variability introduced by using s instead of σ in calculation of the SE
  • It is defined by the degrees of freedom

8 / 25

T vs Z distributions

9 / 25

T distribution

Finding probabilities under the t curve:

#P(t < -1.96)
pt(-1.96, df = 9)
## [1] 0.0408222
#P(t > -1.96)
pt(-1.96, df = 9,
lower.tail = FALSE)
## [1] 0.9591778
10 / 25

T distribution

Finding probabilities under the t curve:

#P(t < -1.96)
pt(-1.96, df = 9)
## [1] 0.0408222
#P(t > -1.96)
pt(-1.96, df = 9,
lower.tail = FALSE)
## [1] 0.9591778

Finding cutoff values under the t curve:

# Find Q1
qt(0.25, df = 9)
## [1] -0.7027221
# Q3
qt(0.75, df = 9)
## [1] 0.7027221
10 / 25

Resident satisfaction in Durham

durham_survey contains resident responses to a survey given by the City of Durham in 2018. These are a randomly selected, representative sample of Durham residents.

Questions were rated 1 - 5, with 1 being "highly dissatisfied" and 5 being "highly satisfied."

11 / 25

Resident satisfaction in Durham

durham_survey contains resident responses to a survey given by the City of Durham in 2018. These are a randomly selected, representative sample of Durham residents.

Questions were rated 1 - 5, with 1 being "highly dissatisfied" and 5 being "highly satisfied."

Is there evidence that, on average, Durham residents are generally satisfied (score greater than 3) with the quality of the public library system?

11 / 25

Exploratory Data Analysis

durham <- read_csv("data/durham_survey.csv") %>%
filter(quality_library != 9)
durham %>%
summarise(x_bar = mean(quality_library),
med = median(quality_library),
sd = sd(quality_library),
n = n())
## # A tibble: 1 × 4
## x_bar med sd n
## <dbl> <dbl> <dbl> <int>
## 1 3.97 4 0.900 521

12 / 25

Hypotheses

What are the hypotheses for evaluating if Durham residents, on average, are generally satisfied with the public library system?

13 / 25

Hypotheses

What are the hypotheses for evaluating if Durham residents, on average, are generally satisfied with the public library system?

H0:μ=3 Ha:μ>3

13 / 25

Conditions

What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?

14 / 25

Conditions

What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?

Independence?

14 / 25

Conditions

What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?

Independence?

✅ The residents were randomly selected for the survey, and 521 is less than 10% of the Durham population (~ 270,000).

14 / 25

Conditions

What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?

Independence?

✅ The residents were randomly selected for the survey, and 521 is less than 10% of the Durham population (~ 270,000).

Sample size / distribution?

14 / 25

Conditions

What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?

Independence?

✅ The residents were randomly selected for the survey, and 521 is less than 10% of the Durham population (~ 270,000).

Sample size / distribution?

✅ 521 > 30, so the sample is large enough to apply the Central Limit Theorem.

14 / 25

Calculating the test statistic

Summary statistics from the sample:

## # A tibble: 1 × 3
## xbar s n
## <dbl> <dbl> <int>
## 1 3.97 0.900 521
15 / 25

Calculating the test statistic

Summary statistics from the sample:

## # A tibble: 1 × 3
## xbar s n
## <dbl> <dbl> <int>
## 1 3.97 0.900 521

And the CLT says:

x¯N(mean=μ,SE=σn)

15 / 25

Calculating the test statistic

Summary statistics from the sample:

## # A tibble: 1 × 3
## xbar s n
## <dbl> <dbl> <int>
## 1 3.97 0.900 521

And the CLT says:

x¯N(mean=μ,SE=σn)

15 / 25

Calculating the test statistic

How many standard errors away from the hypothesized population mean is the observed sample mean? This is our test statistic.

16 / 25

Calculating the test statistic

How many standard errors away from the hypothesized population mean is the observed sample mean? This is our test statistic.

(se <- durham_summary$s / sqrt(durham_summary$n)) # SE
## [1] 0.03944416
16 / 25

Calculating the test statistic

How many standard errors away from the hypothesized population mean is the observed sample mean? This is our test statistic.

(se <- durham_summary$s / sqrt(durham_summary$n)) # SE
## [1] 0.03944416
(t <- (durham_summary$xbar - 3) / se) # Test statistic
## [1] 24.57372
16 / 25

Calculating the p-value

How likely are we to observe a sample mean that is at least as extreme as the observed sample mean, if in fact the null hypothesis is true?

(df <- durham_summary$n - 1) # Degrees of freedom
## [1] 520
17 / 25

Calculating the p-value

How likely are we to observe a sample mean that is at least as extreme as the observed sample mean, if in fact the null hypothesis is true?

(df <- durham_summary$n - 1) # Degrees of freedom
## [1] 520
pt(t, df, lower.tail = FALSE) # P-value, P(T > t |H_0 true)
## [1] 2.247911e-89
17 / 25

Conclusion

The p-value is very small, so we reject H0.

18 / 25

Conclusion

The p-value is very small, so we reject H0.

The data provide sufficient evidence at the α=0.05 level that Durham residents, on average, are satisfied with the quality of the public library system (μ>3)

18 / 25

Conclusion

The p-value is very small, so we reject H0.

The data provide sufficient evidence at the α=0.05 level that Durham residents, on average, are satisfied with the quality of the public library system (μ>3)

Would you expect a 95% confidence interval to include 3?

18 / 25

Confidence interval for a mean

General form of the confidence interval

point estimate±critical value×SE

19 / 25

Confidence interval for a mean

General form of the confidence interval

point estimate±critical value×SE

Confidence interval for the mean

x¯±tn1×sn

19 / 25

Calculate 95% confidence interval

x¯±tn1×sn

20 / 25

Calculate 95% confidence interval

x¯±tn1×sn

# Critical value
t_star <- qt(0.975, df)
20 / 25

Calculate 95% confidence interval

x¯±tn1×sn

# Critical value
t_star <- qt(0.975, df)
# Point estimate
point_est <- durham_summary$xbar
20 / 25

Calculate 95% confidence interval

x¯±tn1×sn

# Critical value
t_star <- qt(0.975, df)
# Point estimate
point_est <- durham_summary$xbar
# Confidence interval
CI <- point_est + c(-1,1) * t_star * se
round(CI, 2)
## [1] 3.89 4.05
20 / 25

Interpret 95% confidence interval

The 95% confidence interval is 3.89 to 4.05.

Interpret this interval in context of the data.

21 / 25

Interpret 95% confidence interval

The 95% confidence interval is 3.89 to 4.05.

Interpret this interval in context of the data.

We are 95% confident that the true mean rating for Durham residents' satisfaction with the library system is between 3.89 and 4.05.

21 / 25

CLT-based inference using infer

22 / 25

CLT-based hypothesis testing in infer

H0:μ=3 vs Ha:μ>3

23 / 25

CLT-based hypothesis testing in infer

H0:μ=3 vs Ha:μ>3

durham %>%
t_test(response = quality_library,
mu = 3,
alternative = "greater",
conf_int = FALSE)
## # A tibble: 1 × 5
## statistic t_df p_value alternative estimate
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 24.6 520 2.25e-89 greater 3.97
23 / 25

CLT-based confidence intervals in infer

Calculate a 95% confidence interval for the mean satisfaction rating.

24 / 25

CLT-based confidence intervals in infer

Calculate a 95% confidence interval for the mean satisfaction rating.

durham %>%
t_test(response = quality_library,
alternative = "two-sided",
conf_int = TRUE, conf_level = 0.95)
## # A tibble: 1 × 7
## statistic t_df p_value alternative estimate lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 101. 520 0 two.sided 3.97 3.89 4.05
24 / 25

What is similar, and what is different, between the CLT-based test of means vs. the simulation-based test?

25 / 25

The Central Limit Theorem

For a population with a well-defined mean μ and standard deviation σ, these three properties hold for the distribution of sample average X¯, assuming certain conditions hold:

✅ The distribution of the sample statistic is nearly normal

✅ The distribution is centered at the (often unknown) population parameter

✅ The variability of the distribution is inversely proportional to the square root of the sample size

2 / 25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow