+ - 0:00:00
Notes for current slide
Notes for next slide

The Central Limit Theorem



Introduction to Data Science @ Duke

introds.org

1 / 30

Sample Statistics and Sampling Distributions

2 / 30

Variability of sample statistics

  • We've seen that each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.)

  • Previously we've quantified this value via simulation

  • Today we talk about some of the theory underlying sampling distributions, particularly as they relate to sample means.

3 / 30

Statistical inference

  • Statistical inference is the act of generalizing from a sample in order to make conclusions regarding a population.

  • We are interested in population parameters, which we do not observe. Instead, we must calculate statistics from our sample in order to learn about them.

  • As part of this process, we must quantify the degree of uncertainty in our sample statistic.

4 / 30

Sampling distribution of the mean

Suppose we’re interested in the mean resting heart rate of students at Duke, and are able to do the following:

5 / 30

Sampling distribution of the mean

Suppose we’re interested in the mean resting heart rate of students at Duke, and are able to do the following:

  1. Take a random sample of size n from this population, and calculate the mean resting heart rate in this sample, X¯1
5 / 30

Sampling distribution of the mean

Suppose we’re interested in the mean resting heart rate of students at Duke, and are able to do the following:

  1. Take a random sample of size n from this population, and calculate the mean resting heart rate in this sample, X¯1

  2. Put the sample back, take a second random sample of size n, and calculate the mean resting heart rate from this new sample, X¯2

5 / 30

Sampling distribution of the mean

Suppose we’re interested in the mean resting heart rate of students at Duke, and are able to do the following:

  1. Take a random sample of size n from this population, and calculate the mean resting heart rate in this sample, X¯1

  2. Put the sample back, take a second random sample of size n, and calculate the mean resting heart rate from this new sample, X¯2

  3. Put the sample back, take a third random sample of size n, and calculate the mean resting heart rate from this sample, too...

5 / 30

Sampling distribution of the mean

Suppose we’re interested in the mean resting heart rate of students at Duke, and are able to do the following:

  1. Take a random sample of size n from this population, and calculate the mean resting heart rate in this sample, X¯1

  2. Put the sample back, take a second random sample of size n, and calculate the mean resting heart rate from this new sample, X¯2

  3. Put the sample back, take a third random sample of size n, and calculate the mean resting heart rate from this sample, too...

...and so on.

5 / 30

Sampling distribution of the mean

After repeating this many times, we have a data set that has the sample means from the population: X¯1, X¯2, , X¯K (assuming we took K total samples).

6 / 30

Sampling distribution of the mean

After repeating this many times, we have a data set that has the sample means from the population: X¯1, X¯2, , X¯K (assuming we took K total samples).

Can we say anything about the distribution of these sample means (that is, the sampling distribution of the mean?)

(Keep in mind, we don't know what the underlying distribution of mean resting heart rate of Duke students looks like!)

6 / 30

The Central Limit Theorem

7 / 30

A quick caveat...

For now, let's assume we know the underlying standard deviation, σ, from our distribution

8 / 30

The Central Limit Theorem

For a population with a well-defined mean μ and standard deviation σ, these three properties hold for the distribution of sample mean X¯, assuming certain conditions hold:

9 / 30

The Central Limit Theorem

For a population with a well-defined mean μ and standard deviation σ, these three properties hold for the distribution of sample mean X¯, assuming certain conditions hold:

  1. The mean of the sampling distribution of the mean is identical to the population mean μ.
9 / 30

The Central Limit Theorem

For a population with a well-defined mean μ and standard deviation σ, these three properties hold for the distribution of sample mean X¯, assuming certain conditions hold:

  1. The mean of the sampling distribution of the mean is identical to the population mean μ.

  2. The standard deviation of the distribution of the sample means is σ/n.

    • This is called the standard error (SE) of the mean.
9 / 30

The Central Limit Theorem

For a population with a well-defined mean μ and standard deviation σ, these three properties hold for the distribution of sample mean X¯, assuming certain conditions hold:

  1. The mean of the sampling distribution of the mean is identical to the population mean μ.

  2. The standard deviation of the distribution of the sample means is σ/n.

    • This is called the standard error (SE) of the mean.
  3. For n large enough, the shape of the sampling distribution of means is approximately normally distributed.

9 / 30

The normal (Gaussian) distribution

The normal distribution is unimodal and symmetric and is described by its density function:

If a random variable X follows the normal distribution, then f(x)=12πσ2exp{12(xμ)2σ2} where μ is the mean and σ2 is the variance (σ is the standard deviation)

We often write N(μ,σ) to describe this distribution.

10 / 30

The normal distribution (graphically)

11 / 30

Wait, any distribution?

The central limit theorem tells us that sample means are normally distributed, if we have enough data and certain assumptions hold.

This is true even if our original variables are not normally distributed.

Click here to see an interactive demonstration of this idea.

12 / 30

Conditions for CLT

We need to check two conditions for CLT to hold: independence, sample size/distribution.

13 / 30

Conditions for CLT

We need to check two conditions for CLT to hold: independence, sample size/distribution.

Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:

  • the sample must be randomly taken
  • if sampling without replacement, sample size must be less than 10% of the population size
13 / 30

Conditions for CLT

We need to check two conditions for CLT to hold: independence, sample size/distribution.

Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:

  • the sample must be randomly taken
  • if sampling without replacement, sample size must be less than 10% of the population size

If samples are independent, then by definition one sample's value does not "influence" another sample's value.

13 / 30

Conditions for CLT

Sample size / distribution:

  • if data are numerical, usually n > 30 is considered a large enough sample for the CLT to apply
  • if we know for sure that the underlying data are normally distributed, then the distribution of sample means will also be exactly normal, regardless of the sample size
  • if data are categorical, at least 10 successes and 10 failures.
14 / 30

Let's run our own simulation

15 / 30

Underlying population (not observed in real life!)

rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100)

The true population parameters

## # A tibble: 1 × 2
## mu sigma
## <dbl> <dbl>
## 1 16.6 14.0
16 / 30

Sampling from the population - 1

set.seed(1)
samp_1 <- rs_pop %>%
sample_n(size = 50) %>%
summarise(x_bar = mean(x))
samp_1
## # A tibble: 1 × 1
## x_bar
## <dbl>
## 1 16.3
17 / 30

Sampling from the population - 2

set.seed(2)
samp_2 <- rs_pop %>%
sample_n(size = 50) %>%
summarise(x_bar = mean(x))
samp_2
## # A tibble: 1 × 1
## x_bar
## <dbl>
## 1 13.9
18 / 30

Sampling from the population - 3

set.seed(3)
samp_3 <- rs_pop %>%
sample_n(size = 50) %>%
summarise(x_bar = mean(x))
samp_3
## # A tibble: 1 × 1
## x_bar
## <dbl>
## 1 19.1
19 / 30

Sampling from the population - 3

set.seed(3)
samp_3 <- rs_pop %>%
sample_n(size = 50) %>%
summarise(x_bar = mean(x))
samp_3
## # A tibble: 1 × 1
## x_bar
## <dbl>
## 1 19.1

keep repeating...

19 / 30

Sampling distribution

set.seed(092620)
sampling <- rs_pop %>%
rep_sample_n(size = 50, replace = TRUE, reps = 5000) %>%
group_by(replicate) %>%
summarise(xbar = mean(x))

The sample statistics

## # A tibble: 1 × 2
## mean se
## <dbl> <dbl>
## 1 16.6 1.98
20 / 30

How do the shapes, centers, and spreads of these distributions compare?

The true population parameters

## # A tibble: 1 × 2
## mu sigma
## <dbl> <dbl>
## 1 16.6 14.0


The sample statistics

## # A tibble: 1 × 2
## mean se
## <dbl> <dbl>
## 1 16.6 1.98
21 / 30

Recap

  • If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.
22 / 30

Recap

  • If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.

  • The center of the sampling distribution is at the center of the population distribution.

22 / 30

Recap

  • If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.

  • The center of the sampling distribution is at the center of the population distribution.

  • The sampling distribution is less variable than the population distribution (and we can quantify by how much).

22 / 30

Recap

  • If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.

  • The center of the sampling distribution is at the center of the population distribution.

  • The sampling distribution is less variable than the population distribution (and we can quantify by how much).

What is the standard error, and how are the standard error and sample size related? What does that say about how the spread of the sampling distribution changes as n increases?

22 / 30

Using R to calculate probabilities from the Normal distribution

23 / 30

Probabilities under N(0,1) curve

# P(Z < -1.5)
pnorm(-1.5)
## [1] 0.0668072

24 / 30

Probability between two values

If ZN(0,1), what is P(1<Z<2)?

25 / 30

Probability between two values

If ZN(0,1), what is P(1<Z<2)?

P(Z < 2)

pnorm(2)
## [1] 0.9772499
26 / 30

Probability between two values

If ZN(0,1), what is P(1<Z<2)?

P(Z < -1)

pnorm(-1)
## [1] 0.1586553
27 / 30

Probability between two values

If ZN(0,1), what is P(1<Z<2)?

P(Z < 2) - P(Z < -1)

pnorm(2) - pnorm(-1)
## [1] 0.8185946
28 / 30

Finding cutoff values under N(0,1) curve

# find Q1
qnorm(0.25)
## [1] -0.6744898

29 / 30

Looking ahead...

We will use the Central Limit Theorem and the normal distribution to conduct statistical inference.

30 / 30

Sample Statistics and Sampling Distributions

2 / 30
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow