class: center, middle, inverse, title-slide # Inference using the Central Limit Theorem ##
Introduction to Data Science @ Duke ###
introds.org
--- layout: true <div class="my-footer"> <span> <a href="https://introds.org" target="_blank">introds.org</a> </span> </div> --- ## The Central Limit Theorem For a population with a well-defined mean `\(\mu\)` and standard deviation `\(\sigma\)`, these three properties hold for the distribution of sample average `\(\bar{X}\)`, assuming certain conditions hold: ✅ The distribution of the sample statistic is nearly normal ✅ The distribution is centered at the (often unknown) population parameter ✅ The variability of the distribution is inversely proportional to the square root of the sample size --- ## Why do we care? Knowing the distribution of the sample statistic `\(\bar{X}\)` can help us -- - estimate a population parameter as **point estimate** `\(\boldsymbol{\pm}\)` **margin of error** - the .vocab[margin of error] is comprised of a measure of how confident we want to be and how variable the sample statistic is <br> -- - test for a population parameter by evaluating how likely it is to obtain to observed sample statistic when assuming that the null hypothesis is true - this probability will depend on how variable the sampling distribution is --- class: center, middle ## Inference based on the CLT --- ## Inference based on the CLT If necessary conditions are met, we can also use inference methods based on the CLT. Suppose we know the true population standard deviation, `\(\sigma\)`. -- Then the CLT tells us that `\(\bar{X}\)` approximately has the distribution `\(N\left(\mu, \sigma/\sqrt{n}\right)\)`. That is, `$$Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)$$` --- class: middle # What if `\(\sigma\)` isn't known? --- ## T distribution In practice, we never know the true value of `\(\sigma\)`, and so we estimate it from our data with `\(s\)`. We can make the following test statistic for testing a single sample's population mean, which has a .vocab[t-distribution with n-1 degrees of freedom]: .question[ $$ T = \frac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1}$$ ] --- ## T distribution - The t-distribution is also unimodal and symmetric, and is centered at 0 -- - It has thicker tails than the normal distribution - This is to make up for additional variability introduced by using `\(s\)` instead of `\(\sigma\)` in calculation of the SE -- - It is defined by the degrees of freedom --- ## T vs Z distributions <img src="clt-inference_files/figure-html/unnamed-chunk-2-1.png" width="75%" style="display: block; margin: auto;" /> --- ## T distribution .pull-left[ .vocab[Finding probabilities under the t curve:] ```r #P(t < -1.96) pt(-1.96, df = 9) ``` ``` ## [1] 0.0408222 ``` ```r #P(t > -1.96) pt(-1.96, df = 9, lower.tail = FALSE) ``` ``` ## [1] 0.9591778 ``` ] -- .pull-right[ .vocab[Finding cutoff values under the t curve:] ```r # Find Q1 qt(0.25, df = 9) ``` ``` ## [1] -0.7027221 ``` ```r # Q3 qt(0.75, df = 9) ``` ``` ## [1] 0.7027221 ``` ] --- ## Resident satisfaction in Durham `durham_survey` contains resident responses to a survey given by the City of Durham in 2018. These are a randomly selected, representative sample of Durham residents. Questions were rated 1 - 5, with 1 being "highly dissatisfied" and 5 being "highly satisfied." -- .question[ Is there evidence that, on average, Durham residents are generally satisfied (score greater than 3) with the quality of the public library system? ] --- ## Exploratory Data Analysis .pull-left[ .small[ ```r durham <- read_csv("data/durham_survey.csv") %>% filter(quality_library != 9) ``` ] .midi[ ```r durham %>% summarise(x_bar = mean(quality_library), med = median(quality_library), sd = sd(quality_library), n = n()) ``` ``` ## # A tibble: 1 × 4 ## x_bar med sd n ## <dbl> <dbl> <dbl> <int> ## 1 3.97 4 0.900 521 ``` ] ] .pull-right[ <img src="clt-inference_files/figure-html/unnamed-chunk-9-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Hypotheses .question[ What are the hypotheses for evaluating if Durham residents, on average, are generally satisfied with the public library system? ] -- `$$H_0: \mu = 3$$` `$$H_a: \mu > 3$$` --- ## Conditions .question[ What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied? ] -- **Independence?** -- ✅ The residents were randomly selected for the survey, and 521 is less than 10% of the Durham population (~ 270,000). -- **Sample size / distribution?** -- ✅ 521 > 30, so the sample is large enough to apply the Central Limit Theorem. --- ## Calculating the test statistic Summary statistics from the sample: ``` ## # A tibble: 1 × 3 ## xbar s n ## <dbl> <dbl> <int> ## 1 3.97 0.900 521 ``` -- And the CLT says: `$$\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)$$` -- --- ## Calculating the test statistic .question[ How many standard errors away from the hypothesized population mean is the observed sample mean? This is our test statistic. ] -- ```r (se <- durham_summary$s / sqrt(durham_summary$n)) # SE ``` ``` ## [1] 0.03944416 ``` -- ```r (t <- (durham_summary$xbar - 3) / se) # Test statistic ``` ``` ## [1] 24.57372 ``` --- ## Calculating the p-value .question[ How likely are we to observe a sample mean that is at least as extreme as the observed sample mean, if in fact the null hypothesis is true? ] ```r (df <- durham_summary$n - 1) # Degrees of freedom ``` ``` ## [1] 520 ``` -- ```r pt(t, df, lower.tail = FALSE) # P-value, P(T > t |H_0 true) ``` ``` ## [1] 2.247911e-89 ``` --- ## Conclusion The p-value is very small, so we reject `\(H_0\)`. -- The data provide sufficient evidence at the `\(\alpha = 0.05\)` level that Durham residents, on average, are satisfied with the quality of the public library system `\((\mu >3)\)` -- .question[ Would you expect a 95% confidence interval to include 3? ] --- ## Confidence interval for a mean .alert[ **General form of the confidence interval** `$$point~estimate \pm critical~value \times SE$$` ] -- .alert[ **Confidence interval for the mean** `$$\bar{x} \pm t^*_{n-1} \times \frac{s}{\sqrt{n}}$$` ] --- ## Calculate 95% confidence interval .alert[ `$$\bar{x} \pm t^*_{n-1} \times \frac{s}{\sqrt{n}}$$` ] -- ```r # Critical value t_star <- qt(0.975, df) ``` -- ```r # Point estimate point_est <- durham_summary$xbar ``` -- ```r # Confidence interval CI <- point_est + c(-1,1) * t_star * se round(CI, 2) ``` ``` ## [1] 3.89 4.05 ``` --- ## Interpret 95% confidence interval The 95% confidence interval is 3.89 to 4.05. .question[ Interpret this interval in context of the data. ] -- **We are 95% confident that the true mean rating for Durham residents' satisfaction with the library system is between 3.89 and 4.05.** --- class: middle # CLT-based inference using `infer` --- # CLT-based hypothesis testing in `infer` `$$H_0: \mu = 3 \text{ vs }H_a: \mu > 3$$` -- ```r durham %>% t_test(response = quality_library, mu = 3, alternative = "greater", conf_int = FALSE) ``` ``` ## # A tibble: 1 × 5 ## statistic t_df p_value alternative estimate ## <dbl> <dbl> <dbl> <chr> <dbl> ## 1 24.6 520 2.25e-89 greater 3.97 ``` --- ## CLT-based confidence intervals in `infer` Calculate a 95% confidence interval for the mean satisfaction rating. -- ```r durham %>% t_test(response = quality_library, alternative = "two-sided", conf_int = TRUE, conf_level = 0.95) ``` ``` ## # A tibble: 1 × 7 ## statistic t_df p_value alternative estimate lower_ci upper_ci ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 101. 520 0 two.sided 3.97 3.89 4.05 ``` --- class: middle .question[ What is similar, and what is different, between the CLT-based test of means vs. the simulation-based test? ]