For a population with a well-defined mean μ and standard deviation σ, these three properties hold for the distribution of sample average ¯X, assuming certain conditions hold:
✅ The distribution of the sample statistic is nearly normal
✅ The distribution is centered at the (often unknown) population parameter
✅ The variability of the distribution is inversely proportional to the square root of the sample size
Knowing the distribution of the sample statistic ¯X can help us
Knowing the distribution of the sample statistic ¯X can help us
If necessary conditions are met, we can also use inference methods based on the CLT. Suppose we know the true population standard deviation, σ.
If necessary conditions are met, we can also use inference methods based on the CLT. Suppose we know the true population standard deviation, σ.
Then the CLT tells us that ¯X approximately has the distribution N(μ,σ/√n).
That is,
Z=¯X−μσ/√n∼N(0,1)
In practice, we never know the true value of σ, and so we estimate it from our data with s.
We can make the following test statistic for testing a single sample's population mean, which has a t-distribution with n-1 degrees of freedom:
T=¯X−μs/√n∼tn−1
The t-distribution is also unimodal and symmetric, and is centered at 0
It has thicker tails than the normal distribution
The t-distribution is also unimodal and symmetric, and is centered at 0
It has thicker tails than the normal distribution
It is defined by the degrees of freedom
Finding probabilities under the t curve:
#P(t < -1.96)pt(-1.96, df = 9)
## [1] 0.0408222
#P(t > -1.96)pt(-1.96, df = 9, lower.tail = FALSE)
## [1] 0.9591778
Finding probabilities under the t curve:
#P(t < -1.96)pt(-1.96, df = 9)
## [1] 0.0408222
#P(t > -1.96)pt(-1.96, df = 9, lower.tail = FALSE)
## [1] 0.9591778
Finding cutoff values under the t curve:
# Find Q1qt(0.25, df = 9)
## [1] -0.7027221
# Q3qt(0.75, df = 9)
## [1] 0.7027221
durham_survey
contains resident responses to a survey given by the City of
Durham in 2018. These are a randomly selected, representative sample of
Durham residents.
Questions were rated 1 - 5, with 1 being "highly dissatisfied" and 5 being "highly satisfied."
durham_survey
contains resident responses to a survey given by the City of
Durham in 2018. These are a randomly selected, representative sample of
Durham residents.
Questions were rated 1 - 5, with 1 being "highly dissatisfied" and 5 being "highly satisfied."
Is there evidence that, on average, Durham residents are generally satisfied (score greater than 3) with the quality of the public library system?
durham <- read_csv("data/durham_survey.csv") %>% filter(quality_library != 9)
durham %>% summarise(x_bar = mean(quality_library), med = median(quality_library), sd = sd(quality_library), n = n())
## # A tibble: 1 × 4## x_bar med sd n## <dbl> <dbl> <dbl> <int>## 1 3.97 4 0.900 521
What are the hypotheses for evaluating if Durham residents, on average, are generally satisfied with the public library system?
What are the hypotheses for evaluating if Durham residents, on average, are generally satisfied with the public library system?
H0:μ=3 Ha:μ>3
What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?
What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?
Independence?
What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?
Independence?
✅ The residents were randomly selected for the survey, and 521 is less than 10% of the Durham population (~ 270,000).
What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?
Independence?
✅ The residents were randomly selected for the survey, and 521 is less than 10% of the Durham population (~ 270,000).
Sample size / distribution?
What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied?
Independence?
✅ The residents were randomly selected for the survey, and 521 is less than 10% of the Durham population (~ 270,000).
Sample size / distribution?
✅ 521 > 30, so the sample is large enough to apply the Central Limit Theorem.
Summary statistics from the sample:
## # A tibble: 1 × 3## xbar s n## <dbl> <dbl> <int>## 1 3.97 0.900 521
Summary statistics from the sample:
## # A tibble: 1 × 3## xbar s n## <dbl> <dbl> <int>## 1 3.97 0.900 521
And the CLT says:
¯x∼N(mean=μ,SE=σ√n)
Summary statistics from the sample:
## # A tibble: 1 × 3## xbar s n## <dbl> <dbl> <int>## 1 3.97 0.900 521
And the CLT says:
¯x∼N(mean=μ,SE=σ√n)
How many standard errors away from the hypothesized population mean is the observed sample mean? This is our test statistic.
How many standard errors away from the hypothesized population mean is the observed sample mean? This is our test statistic.
(se <- durham_summary$s / sqrt(durham_summary$n)) # SE
## [1] 0.03944416
How many standard errors away from the hypothesized population mean is the observed sample mean? This is our test statistic.
(se <- durham_summary$s / sqrt(durham_summary$n)) # SE
## [1] 0.03944416
(t <- (durham_summary$xbar - 3) / se) # Test statistic
## [1] 24.57372
How likely are we to observe a sample mean that is at least as extreme as the observed sample mean, if in fact the null hypothesis is true?
(df <- durham_summary$n - 1) # Degrees of freedom
## [1] 520
How likely are we to observe a sample mean that is at least as extreme as the observed sample mean, if in fact the null hypothesis is true?
(df <- durham_summary$n - 1) # Degrees of freedom
## [1] 520
pt(t, df, lower.tail = FALSE) # P-value, P(T > t |H_0 true)
## [1] 2.247911e-89
The p-value is very small, so we reject H0.
The data provide sufficient evidence at the α=0.05 level that Durham residents, on average, are satisfied with the quality of the public library system (μ>3)
The p-value is very small, so we reject H0.
The data provide sufficient evidence at the α=0.05 level that Durham residents, on average, are satisfied with the quality of the public library system (μ>3)
Would you expect a 95% confidence interval to include 3?
General form of the confidence interval
point estimate±critical value×SE
General form of the confidence interval
point estimate±critical value×SE
Confidence interval for the mean
¯x±t∗n−1×s√n
¯x±t∗n−1×s√n
# Critical value t_star <- qt(0.975, df)
¯x±t∗n−1×s√n
# Critical value t_star <- qt(0.975, df)
# Point estimate point_est <- durham_summary$xbar
¯x±t∗n−1×s√n
# Critical value t_star <- qt(0.975, df)
# Point estimate point_est <- durham_summary$xbar
# Confidence intervalCI <- point_est + c(-1,1) * t_star * seround(CI, 2)
## [1] 3.89 4.05
The 95% confidence interval is 3.89 to 4.05.
Interpret this interval in context of the data.
The 95% confidence interval is 3.89 to 4.05.
Interpret this interval in context of the data.
We are 95% confident that the true mean rating for Durham residents' satisfaction with the library system is between 3.89 and 4.05.
infer
H0:μ=3 vs Ha:μ>3
durham %>% t_test(response = quality_library, mu = 3, alternative = "greater", conf_int = FALSE)
## # A tibble: 1 × 5## statistic t_df p_value alternative estimate## <dbl> <dbl> <dbl> <chr> <dbl>## 1 24.6 520 2.25e-89 greater 3.97
infer
Calculate a 95% confidence interval for the mean satisfaction rating.
infer
Calculate a 95% confidence interval for the mean satisfaction rating.
durham %>% t_test(response = quality_library, alternative = "two-sided", conf_int = TRUE, conf_level = 0.95)
## # A tibble: 1 × 7## statistic t_df p_value alternative estimate lower_ci upper_ci## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>## 1 101. 520 0 two.sided 3.97 3.89 4.05
What is similar, and what is different, between the CLT-based test of means vs. the simulation-based test?
For a population with a well-defined mean μ and standard deviation σ, these three properties hold for the distribution of sample average ¯X, assuming certain conditions hold:
✅ The distribution of the sample statistic is nearly normal
✅ The distribution is centered at the (often unknown) population parameter
✅ The variability of the distribution is inversely proportional to the square root of the sample size
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |