class: center, middle, inverse, title-slide # The Central Limit Theorem ##
Introduction to Data Science @ Duke ###
introds.org
--- layout: true <div class="my-footer"> <span> <a href="https://introds.org" target="_blank">introds.org</a> </span> </div> --- class: center, middle ## Sample Statistics and Sampling Distributions --- ## Variability of sample statistics - We've seen that each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.) - Previously we've quantified this value via simulation - Today we talk about some of the theory underlying **sampling distributions**, particularly as they relate to sample means. --- ## Statistical inference - Statistical inference is the act of generalizing from a sample in order to make conclusions regarding a population. - We are interested in population parameters, which we do not observe. Instead, we must calculate statistics from our sample in order to learn about them. - As part of this process, we must quantify the degree of uncertainty in our sample statistic. --- ## Sampling distribution of the mean Suppose we’re interested in the mean resting heart rate of students at Duke, and are able to do the following: -- 1. Take a random sample of size `\(n\)` from this population, and calculate the mean resting heart rate in this sample, `\(\bar{X}_1\)` -- 2. Put the sample back, take a second random sample of size `\(n\)`, and calculate the mean resting heart rate from this new sample, `\(\bar{X}_2\)` -- 3. Put the sample back, take a third random sample of size `\(n\)`, and calculate the mean resting heart rate from this sample, too... -- ...and so on. --- ## Sampling distribution of the mean After repeating this many times, we have a data set that has the sample means from the population: `\(\bar{X}_1\)`, `\(\bar{X}_2\)`, `\(\cdots\)`, `\(\bar{X}_K\)` (assuming we took `\(K\)` total samples). -- .question[ Can we say anything about the distribution of these sample means (that is, the **sampling distribution** of the mean?) ] *(Keep in mind, we don't know what the underlying distribution of mean resting heart rate of Duke students looks like!)* --- class: center, middle ## The Central Limit Theorem --- class: middle A quick caveat... For now, let's assume we know the underlying standard deviation, `\(\sigma\)`, from our distribution --- ## The Central Limit Theorem For a population with a well-defined mean `\(\mu\)` and standard deviation `\(\sigma\)`, these three properties hold for the distribution of sample mean `\(\bar{X}\)`, assuming certain conditions hold: -- 1. The mean of the sampling distribution of the mean is identical to the population mean `\(\mu\)`. -- 2. The standard deviation of the distribution of the sample means is `\(\sigma/\sqrt{n}\)`. - This is called the **standard error (SE)** of the mean. -- 3. For `\(n\)` large enough, the shape of the sampling distribution of means is approximately **normally distributed**. --- ## The normal (Gaussian) distribution The normal distribution is unimodal and symmetric and is described by its **density function**: If a random variable `\(X\)` follows the normal distribution, then `$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2} \right\}$$` where `\(\mu\)` is the mean and `\(\sigma^2\)` is the variance `\((\sigma \text{ is the standard deviation})\)` .alert[ We often write `\(N(\mu, \sigma)\)` to describe this distribution. ] --- ## The normal distribution (graphically) <img src="clt-intro_files/figure-html/unnamed-chunk-2-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Wait, *any* distribution? The central limit theorem tells us that **sample means** are normally distributed, if we have enough data and certain assumptions hold. This is true *even if our original variables are not normally distributed*. Click [here](http://onlinestatbook.com/stat_sim/sampling_dist/index.html) to see an interactive demonstration of this idea. --- ## Conditions for CLT We need to check two conditions for CLT to hold: independence, sample size/distribution. -- ✅ **Independence:** The sampled observations must be independent. This is difficult to check, but the following are useful guidelines: - the sample must be randomly taken - if sampling without replacement, sample size must be less than 10% of the population size -- If samples are independent, then by definition one sample's value does not "influence" another sample's value. --- ## Conditions for CLT ✅ **Sample size / distribution:** - if data are numerical, usually n > 30 is considered a large enough sample for the CLT to apply - if we know for sure that the underlying data are normally distributed, then the distribution of sample means will also be exactly normal, regardless of the sample size - if data are categorical, at least 10 successes and 10 failures. --- class: middle, center ## Let's run our own simulation --- ### Underlying population (not observed in real life!) .small[ ```r rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100) ``` ] .pull-left[ <img src="clt-intro_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ **The true population parameters** ``` ## # A tibble: 1 × 2 ## mu sigma ## <dbl> <dbl> ## 1 16.6 14.0 ``` ] --- ## Sampling from the population - 1 ```r set.seed(1) samp_1 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` ```r samp_1 ``` ``` ## # A tibble: 1 × 1 ## x_bar ## <dbl> ## 1 16.3 ``` --- ## Sampling from the population - 2 ```r set.seed(2) samp_2 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` ```r samp_2 ``` ``` ## # A tibble: 1 × 1 ## x_bar ## <dbl> ## 1 13.9 ``` --- ## Sampling from the population - 3 ```r set.seed(3) samp_3 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` ```r samp_3 ``` ``` ## # A tibble: 1 × 1 ## x_bar ## <dbl> ## 1 19.1 ``` -- keep repeating... --- ## Sampling distribution .small[ ```r set.seed(092620) sampling <- rs_pop %>% rep_sample_n(size = 50, replace = TRUE, reps = 5000) %>% group_by(replicate) %>% summarise(xbar = mean(x)) ``` ] .pull-left[ <img src="clt-intro_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ **The sample statistics** ``` ## # A tibble: 1 × 2 ## mean se ## <dbl> <dbl> ## 1 16.6 1.98 ``` ] --- .question[ How do the shapes, centers, and spreads of these distributions compare? ] .pull-left[ <img src="clt-intro_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ **The true population parameters** ``` ## # A tibble: 1 × 2 ## mu sigma ## <dbl> <dbl> ## 1 16.6 14.0 ``` <br> **The sample statistics** ``` ## # A tibble: 1 × 2 ## mean se ## <dbl> <dbl> ## 1 16.6 1.98 ``` ] --- ## Recap - If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution. -- - The center of the sampling distribution is at the center of the population distribution. -- - The sampling distribution is less variable than the population distribution (and we can quantify by how much). -- .question[ What is the standard error, and how are the standard error and sample size related? What does that say about how the spread of the sampling distribution changes as `\(n\)` increases? ] --- class: center, middle ## Using R to calculate probabilities from the Normal distribution --- ## Probabilities under N(0,1) curve ```r # P(Z < -1.5) pnorm(-1.5) ``` ``` ## [1] 0.0668072 ``` <img src="clt-intro_files/figure-html/unnamed-chunk-19-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="clt-intro_files/figure-html/unnamed-chunk-20-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] .pull-left[ <img src="clt-intro_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ **P(Z < 2)** ```r pnorm(2) ``` ``` ## [1] 0.9772499 ``` ] --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] .pull-left[ <img src="clt-intro_files/figure-html/unnamed-chunk-23-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ **P(Z < -1)** ```r pnorm(-1) ``` ``` ## [1] 0.1586553 ``` ] --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] .pull-left[ <img src="clt-intro_files/figure-html/unnamed-chunk-25-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ **P(Z < 2) - P(Z < -1)** ```r pnorm(2) - pnorm(-1) ``` ``` ## [1] 0.8185946 ``` ] --- ## Finding cutoff values under N(0,1) curve ```r # find Q1 qnorm(0.25) ``` ``` ## [1] -0.6744898 ``` <img src="clt-intro_files/figure-html/unnamed-chunk-28-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Looking ahead... We will use the Central Limit Theorem and the normal distribution to conduct statistical inference.