Therefore we can estimate \mu_X - \mu_Y with the sample means, that is,
\text{Estimate} = \overline{X} - \overline{Y}
The two-sample t-statistic
Since we are testing for difference in mean, we have
\text{Hypothesised value} = \mu_X - \mu_Y
The Estimated Standard Error is the standard deviation of estimator
\text{e.s.e.} = \text{Standard Deviation of } \overline{X} -\overline{Y}
The two-sample t-statistic
Therefore the two-sample t-statistic is
T = \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\text{e.s.e.}}
Under the Null Hypothesis that \mu_X = \mu_Y, the t-statistic becomes
T = \frac{\overline{X} - \overline{Y} }{\text{e.s.e.}}
A note on the degrees of freedom (df)
The general rule is
\text{df} = \text{Sample size} - \text{No. of estimated parameters}
Sample size in two-sample t-test:
n in the first sample
m in the second sample
Hence total number of observations is n + m
No. of estimated parameters is 2: Namely \mu_X and \mu_Y
Hence degree of freedoms in two-sample t-test is
{\rm df} = n + m - 2
(more on this later)
The estimated standard error
Recall: We are assuming populations have same variance
\sigma^2_X = \sigma^2_Y = \sigma^2
We need to compute the estimated standard error
\text{e.s.e.} = \text{Standard Deviation of } \ \overline{X} -\overline{Y}
Variance of sample mean was computed in the Lemma in Slide 72 Lecture 2
Since \overline{X} \sim N(\mu_X,\sigma^2) and \overline{Y} \sim N(\mu_Y,\sigma^2), by the Lemma we get
{\rm Var}[\overline{X}] = \frac{\sigma^2}{n} \,, \qquad \quad
{\rm Var}[\overline{Y}] = \frac{\sigma^2}{m}
The estimated standard error
Since X_i and Y_i are independent we get {\rm Cov}(X_i,Y_j)=0
By bilinearity of covariance we infer
{\rm Cov}( \overline{X} , \overline{Y} ) = \frac{1}{n \cdot m} \sum_{i=1}^n \sum_{j=1}^m {\rm Cov}(X_i,Y_j) = 0
From Lecture 2, we know that S_X^2 and S_Y^2 are unbiased estimators of \sigma^2, i.e.
{\rm I\kern-.3em E}[ S_X^2 ] = {\rm I\kern-.3em E}[ S_Y^2 ] = \sigma^2
Therefore, both S_X^2 and S_Y^2 can be used to estimate \sigma^2
Estimating the variance
We can improve the estimate of \sigma^2 by combining S_X^2 and S_Y^2
We will consider a (convex) linear combination
S^2 := \lambda_X S_X^2 + \lambda_Y S_Y^2 \,, \qquad
\lambda_X + \lambda_Y = 1
S^2 is still an unbiased estimator of \sigma^2, since \begin{align*}
{\rm I\kern-.3em E}[S^2] & = {\rm I\kern-.3em E}[ \lambda_X S_X^2 + \lambda_Y S_Y^2 ] \\
& = \lambda_X {\rm I\kern-.3em E}[S_X^2] + \lambda_Y {\rm I\kern-.3em E}[S_Y^2] \\
& = (\lambda_X + \lambda_Y) \sigma^2 \\
& = \sigma^2
\end{align*}
Estimating the variance
We choose coefficients \lambda_X and \lambda_Y which reflect sample sizes
\lambda_X := \frac{n - 1}{n + m - 2} \qquad
\qquad
\lambda_Y := \frac{m - 1}{n + m - 2}
Notes:
We have \lambda_X + \lambda_Y = 1
Denominators in \lambda_X and \lambda_Y are degrees of freedom {\rm df } = n + m - 2
This choice is made so that S^2 has chi-squared distribution (more on this later)
Pooled estimator of variance
Definition
The pooled estimator of \sigma^2 is defined as
S_p^2 := \lambda_X S_X^2 + \lambda_Y S_Y^2
= \frac{(n-1) S_X^2 + (m-1) S_Y^2}{n + m - 2}
Note:
n=m implies \lambda_X = \lambda_Y
In this case S_X^2 and S_Y^2 have same weight in S_p^2
The two-sample t-statistic
The t-statistic has currently the form
T = \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sigma \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}
We replace \sigma with the pooled estimator S_p
The two-sample t-statistic
Definition
The two sample t-statistic is defined as
T := \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{ S_p \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}
Note: Under the Null Hypothesis that \mu_X = \mu_Y this becomes
T = \frac{\overline{X} - \overline{Y}}{ S_p \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}
= \frac{\overline{X} - \overline{Y}}{ \sqrt{ \dfrac{ (n-1) S_X^2 + (m-1) S_Y^2 }{n + m - 2} } \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}
Distribution of two-sample t-statistic
Theorem
The two sample t-statistic has t_{n+m-2} distribution
T := \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{ S_p \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}} \sim t_{n + m - 2}
Distribution of two-sample t-statistic
Proof
We have already seen that \overline{X} - \overline{Y} is normal with
{\rm I\kern-.3em E}[\overline{X} - \overline{Y}] = \mu_X - \mu_Y
\qquad \qquad
{\rm Var}[\overline{X} - \overline{Y}] = \sigma^2 \left( \frac{1}{n} + \frac{1}{m} \right)
Therefore we can rescale \overline{X} - \overline{Y} to get
U := \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{ \sigma \sqrt{ \dfrac{1}{n} + \dfrac{1}{m}}} \sim N(0,1)
Distribution of two-sample t-statistic
Proof
We are assuming X_1, \ldots, X_n iid N(\mu_X,\sigma^2)
Therefore, as already shown, we have
\frac{ (n-1) S_X^2 }{ \sigma^2 } \sim \chi_{n-1}^2
Similarly, since Y_1, \ldots, Y_m iid N(\mu_Y,\sigma^2), we get
\frac{ (m-1) S_Y^2 }{ \sigma^2 } \sim \chi_{m-1}^2
Distribution of two-sample t-statistic
Proof
Since X_i and Y_j are independent, we also have that
\frac{ (n-1) S_X^2 }{ \sigma^2 } \quad \text{ and } \quad
\frac{ (m-1) S_Y^2 }{ \sigma^2 } \quad \text{ are independent}
Interpretation: Reject H_0 when either
p < 0.05 \qquad \text{ or } \qquad t \in \,\,\text{Rejection Region}
\qquad \qquad \qquad \qquad
(T \, \sim \, t_{n+m-1})
Alternative
Rejection Region
t^*
p-value
\mu_X \neq \mu_Y
|t| > t^*
t_{n+m-1}(0.025)
2P(T > |t|)
\mu_X < \mu_Y
t < - t^*
t_{n+m-1}(0.05)
P(T < t)
\mu_X > \mu_Y
t > t^*
t_{n+m-1}(0.05)
P(T > t)
Reject H_0 if t-statistic t falls in the Rejection Region (in gray). Here t \sim t_{n+m-1}
The two-sample t-test in R
General commands
Store the samples x_1,\ldots,x_n and y_1,\ldots,y_m in two R vectors
x <- c(x1, ..., xn)\qquady <- c(y1, ..., ym)
Perform a two-sample t-test on x and y
Alternative
R command
\mu_X \neq \mu_Y
t.test(x, y, var.equal = T)
\mu_X < \mu_Y
t.test(x, y, var.equal = T, alt = "less")
\mu_X > \mu_Y
t.test(x, y, var.equal = T, alt = "greater")
Read output: similar to one-sample t-test
The main quantity of interest is p-value
Comments on command t.test(x, y)
mu = mu0 tells R to test null hypothesis:
H_0 \colon \mu_X - \mu_Y = \mu_0 \qquad \quad (\text{default is } \, \mu_0 = 0)
var.equal = T tells R to assume that populations have same variance
\sigma_X^2 = \sigma^2_Y
In this case R computes the t-statistic with formula discussed earlier
t = \frac{ \overline{x} - \overline{y} }{s_p \sqrt{ \dfrac{1}{n} + \dfrac{1}{m} }}
Comments on command t.test(x, y)
Warning: If var.equal = T is not specified then
R assumes that populations have different variance \sigma_X^2 \neq \sigma^2_Y
In this case the t-statistic
t = \frac{ \overline{x} - \overline{y} }{s_p \sqrt{ \dfrac{1}{n} + \dfrac{1}{m} }}
is NOT t-distributed
R performs the Welch t-test instead of the classic t-test
(more on this later)
Part 6: Two-sample t-test Example
Mathematicians
x_1
x_2
x_3
x_4
x_5
x_6
x_7
x_8
x_9
x_{10}
Wages
36
40
46
54
57
58
59
60
62
63
Accountants
y_1
y_2
y_3
y_4
y_5
y_6
y_7
y_8
y_9
y_{10}
y_{11}
y_{12}
y_{13}
Wages
37
37
42
44
46
48
54
56
59
60
60
64
64
Samples: Wage data on 10 Mathematicians and 13 Accountants
Wages are independent and normally distributed
Populations have equal variance
Quesion: Is there evidence of differences in average pay?
Answer: Two-sample two-sided t-test for the hypothesis
H_0 \colon \mu_X = \mu_Y \,,\qquad
H_1 \colon \mu_X \neq \mu_Y
Degrees of freedom are {\rm df} = n + m - 2 = 10 + 13 - 2 = 21
Find corresponding critical value in Table 1
t_{21}(0.025) = 2.08
(Note the value 0.025, since this is two-sided test)
Completing the t-test
Interpretation:
We have that
| t | = 0.464 < 2.08 = t_{21}(0.025)
t falls in the acceptance region
Therefore the p-value satisfies p>0.05
There is no evidence (p>0.05) in favor of H_1
Hence we accept that \mu_X = \mu_Y
Conclusion: Average pay levels seem to be the same for both professions
The two-sample t-test in R
This is a two-sided t-test with assumption of equal variance. The p-value is
p = 2 P(t_{n-1} > |t|) \,, \qquad t = \frac{\bar{x} - \bar{y} }{s_p \ \sqrt{1/n + 1/m}}
# Enter Wages data in 2 vectors using function c()mathematicians <-c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)accountants <-c(37, 37, 42, 44, 46, 48, 54, 56, 59, 60, 60, 64, 64)# Two-sample t-test with null hypothesis mu_X = mu_Y# and equal variance assumption. Store result in answer and print.answer <-t.test(mathematicians, accountants, var.equal =TRUE)print(answer)
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
First line: R tells us that a Two-Sample t-test is performed
Second line: Data for t-test is mathematicians and accountants
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Third line:
The t-statistic computed is t = 0.46359
Note: This coincides with the one computed by hand!
There are 21 degrees of freedom
The p-values is p = 0.6477
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Fourth line: The alternative hypothesis is that the difference in means is not zero
This translates to H_1 \colon \mu_X \neq \mu_Y
Warning: This is not saying to reject H_0 – R is just stating H_1
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Fifth line: R computes a 95 \% confidence interval for \mu_X - \mu_Y
(\mu_X - \mu_Y) \in [-6.569496, 10.338727]
Interpretation: If you repeat the experiment (on new data) over and over, the interval [a,b] will contain \mu_X - \mu_Y about 95\% of the times
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Seventh line: R computes sample mean for the two populations
Sample mean for mathematicians is 53.5
Sample mean for accountants is 51.61538
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Conclusion: The p-value is p = 0.6477
Since p > 0.05 we do not reject H_0
Hence \mu_X and \mu_Y appear to be similar
Average pay levels seem to be the same for both professions
Comment on Assumptions
The previous two-sample t-test was conducted under the following assumptions:
Wages data is normally distributed
The two populations have equal variance
Using R, we can plot the data to see if these are reasonable (graphical exploration)
Warnings: Even if the assumptions hold
we cannot expect the samples to be exactly normal (bell-shaped)
rather, look for approximate normality
we cannot expect the sample variances to match
rather, look for a similar spread in the data
Estimating the sample distribution in R
Suppose given a data sample stored in a vector z
If the sample is large, we can check normality by plotting the histogram of z
Example: z sample of size 1000 form N(0,1) – Its histogram resembles N(0,1)
z <-rnorm(1000) # Sample 1000 times from N(0,1) hist(z, probability =TRUE) # Plot histogram, with area scaled to 1
Drawback: Small samples \implies hard to check normality from histogram
This is true even if the data is normal
Example: z below is sample of size 9 from N(0,1) – But histogram not normal
z <-c(-0.78, -1.67, -0.38, 0.92, -0.58, 0.61, -1.62, -0.06, 0.52) # Random from N(0,1)hist(z, probability =TRUE) # Histogram not normal
Solution: Suppose given iid sample z from a distribution f
The command density(z) estimates the population distribution f
(Estimate based on the sampling distribution of z and smoothing - Not easy task)
Example: z as in previous slide. The plot of density(z) shows normal behavior
z <-c(-0.78, -1.67, -0.38, 0.92, -0.58, 0.61, -1.62, -0.06, 0.52) # Random from N(0,1)dz <-density(z) # Estimate the density of zplot(dz) # Plot the estimated density
The R object density(z) models a 1D function (the estimated distribution of z)
As such, it contains a grid of x values, with associated y values
x values are stored in vector density(z)$x
y values are stored in vector density(z)$y
These values are useful to set the axis range in a plot
dz <-density(z)plot(dz, # Plot dzxlim =range(dz$x), # Set x-axis rangeylim =range(dz$y)) # Set y-axis range
Axes range set as the min and max values of components of dz
Checking the Assumptions on our Example
# Compute the estimated distributionsd.math <-density(mathematicians)d.acc <-density(accountants)# Plot the estimated distributionsplot(d.math, # Plot d.mathxlim =range(c(d.math$x, d.acc$x)), # Set x-axis rangeylim =range(c(d.math$y, d.acc$y)), # Set y-axis rangemain ="Estimated Distributions of Wages") # Add title to plotlines(d.acc, # Layer plot of d.acclty =2) # Use different line stylelegend("topleft", # Add legend at top-leftlegend =c("Mathematicians", # Labels for legend"Accountants"), lty =c(1, 2)) # Assign curves to legend
Axes range set as the min and max values of components of d.math and d.acc
Wages data looks approximately normally distributed (roughly bell-shaped)
The two populations have similar variance (spreads look similar)
Conclusion: Two-sample t-test with equal variance is appropriate \implies accept H_0
Part 7: The Welch t-test
Samples with different variance
We just examined the two-sample t-tests
This assumes independent normal populations with equal variance
\sigma_X^2 = \sigma_Y^2
Question: What happens if variances are different?
Answer: Use the Welch Two-sample t-test
This is a generalization of the two-sample t-test to the case \sigma_X^2 \neq \sigma_Y^2
In R it is performed with t.test(x, y)
Note that we are just omitting the option var.equal = TRUE
Equivalently, you may specify var.equal = FALSE
The Welch two-sample t-test
Welch t-test consists in computing the Welch statistic
w = \frac{\overline{x} - \overline{y}}{ \sqrt{ \dfrac{s_X^2}{n} + \dfrac{s_Y^2}{m} } }
If sample sizes m,n > 5, then w is approximately t-distributed
Degrees of freedom are not integer, and depend on S_X, S_Y, n, m
If variances are similar, the welch statistic is comparable to the t-statistic
w \approx t : = \frac{ \overline{x} - \overline{y} }{s_p \sqrt{ \dfrac{1}{n} + \dfrac{1}{m} }}
Welch t-test Vs two-sample t-test
If variances are similar:
Welch statistic and t-statistic are similar
p-value from Welch t-test is similar to p-value from two-sample t-test
Since p-values are similar, most times the 2 tests yield same decision
The tests can be used interchangeably
If variances are very different:
Welch statistic and t-statistic are different
p-values from the two tests can differ a lot
The two tests might give different decision
Wrong to apply two-sample t-test, as variances are different
The Welch two-sample t-test in R
# Enter Wages datamathematicians <-c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)accountants <-c(37, 37, 42, 44, 46, 48, 54, 56, 59, 60, 60, 64, 64)# Perform Welch two-sample t-test with null hypothesis mu_X = mu_Y# Store result of t.test in answeranswer <-t.test(mathematicians, accountants)# Print answerprint(answer)
Note:
This is almost the same code as in Slide 86
Only difference: we are omitting the option var.equal = TRUE in t.test
Welch Two Sample t-test
data: mathematicians and accountants
t = 0.46546, df = 19.795, p-value = 0.6467
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.566879 10.336109
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
First line: R tells us that a Welch Two-Sample t-test is performed
The rest of the output is similar to classic t-test
Main difference is that p-value and t-statistic differ from classic t-test
Welch Two Sample t-test
data: mathematicians and accountants
t = 0.46546, df = 19.795, p-value = 0.6467
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.566879 10.336109
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Third line:
The Welch t-statistic is w = 0.46546 (standard t-test gave t = 0.46359)
Degrees of freedom are fractionary \rm{df} = 19.795 (standard t-test \rm{df} = 21)
The Welch t-statistic is approximately t-distributed with W \approx t_{19.795}
Fifth line: The confidence interval for \mu_X - \mu_Y is also different
Welch Two Sample t-test
data: mathematicians and accountants
t = 0.46546, df = 19.795, p-value = 0.6467
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.566879 10.336109
sample estimates:
mean of x mean of y
53.50000 51.61538
Conclusion: The p-values obtained with the 2 tests are almost the same
# Perform two-sample t-test and retrieve p-valueans <-t.test(trA, trB, alt ="less", var.equal = T) ans$p.value
[1] 0.05836482
The p-value is p > 0.05
H_0 cannot be rejected \implies There is no evidence that Treatment A is better
Wrong conclusion, because two-sample t-test does not apply
Disclaimer
The previous data was synthetic, and the background story was made up!
Nonetheless, the example is still valid
To construct the data, I sampled as follows
Treatment A: Sample of size 6 from N(-2,1)
Treatment B: Sample of size 12 from N(-1.5,9)
We see that
\mu_A < \mu_B \,, \qquad
\sigma_A^2 \neq \sigma_B^2
This tells us that:
We can expect that some samples will support that \mu_A < \mu_B
Two-sample t-test is inappropriate because \sigma_A^2 \neq \sigma_B^2
Generating the Data
Click here to see the code I used
# Set seed for random generation# This way you always get the same random numbers when# you run this codeset.seed(21) repeat {# Generate random samples x <-rnorm(6, mean =-2, sd =1) y <-rnorm(12, mean =-1.5, sd =3)# Round x and y to 1 decimal point x <-round(x, 1) y <-round(y, 1)# Perform one-sided t-tests for alternative hypothesis mu_x < mu_y ans_welch <-t.test(x, y, alt ="less", var.equal = F) ans_t_test <-t.test(x, y, alt ="less", var.equal = T)# Check that Welch test succeeds and two-sample test failsif (ans_welch$p.value <0.05&& ans_t_test$p.value >0.05) {cat("Data successfully generated!!!\n\n")cat("Synthetic Data TrA:", x, "\n")cat("Synthetic Data TrB:", y, "\n\n")cat("Welch t-test p-value:", ans_welch$p.value, "\n")cat("Two-sample t-test p-value:", ans_t_test$p.value)break }}
Bottom line: The data is paired, therefore a paired t-test must be used
Part 1: The F-distribution
Recall
The chi-squared distribution with p degrees of freedom is
\chi_p^2 = Z_1^2 + \ldots + Z_p^2 \qquad \text{where} \qquad Z_1, \ldots, Z_n \,\,\, \text{iid} \,\,\, N(0, 1)
Chi-squared distribution was used to:
Describe distribution of sample variance S^2:
\frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2
Define t-distribution with p degrees of freedom:
t_p \sim \frac{U}{\sqrt{V/p}} \qquad \text{where} \qquad
U \sim N(0,1) \,, \quad V \sim \chi_p^2 \quad \text{ independent}
The F-distribution
Definition
The r.v. F has F-distribution with p and q degrees of freedom if the pdf is
f_F(x) = \frac{ \Gamma \left(\frac{p+q}{2} \right) }{ \Gamma \left( \frac{p}{2} \right) \Gamma \left( \frac{q}{2} \right) }
\left( \frac{p}{q} \right)^{p/2} \,
\frac{ x^{ (p/2) - 1 } }{ [ 1 + (p/q) x ]^{(p+q)/2} } \,, \quad x > 0
Notation: F-distribution with p and q degrees of freedom is denoted by F_{p,q}
Remark: Used to describes variance estimators for independent samples
Plot of F-distributions
Characterization of F-distribution
The F-distribution is obtained as ratio of 2 independent chi-squared distributions
Theorem
Suppose that U \sim \chi_p^2 and V \sim \chi_q^2 are independent. Then
X := \frac{U/p}{V/q} \sim F_{p,q}
Idea of Proof
This is similar to the proof (seen in Homework 2) that
\frac{U}{\sqrt{V/p}} \sim t_p
where U \sim N(0,1) and V \sim \chi_p^2 are independent
In our case we need to prove
X := \frac{U/p}{V/q} \sim F_{p,q}
where U \sim \chi_p^2 and V \sim \chi_q^2 are independent
Consider the change of variables
x(u,v) := \frac{u/p}{v/q} \,, \quad y(u,v) := u + v
Idea of Proof
This way we have
X = \frac{U/p}{V/q} \,, \qquad Y = U + V
To conclude the proof, we need to compute the pdf of X, that is f_X
This can be computed as the X marginal of f_{X,Y}
f_{X}(x) = \int_{0}^\infty f_{X,Y}(x,y) \, dy
Idea of Proof
The joint pdf f_{X,Y} can be computed by inverting the change of variables
x(u,v) := \frac{u/p}{v/q} \,, \quad y(u,v) := u + v
and using the formula
f_{X,Y}(x,y) = f_{U,V}(u(x,y),v(x,y)) \, |\det J|
where J is the Jacobian of the inverse transformation
(x,y) \mapsto (u(x,y),v(x,y))
Idea of Proof
Since f_{U,V} is known, then also f_{X,Y} is known
Moreover the integral
f_{X}(x) = \int_{0}^\infty f_{X,Y}(x,y) \, dy
can be explicitly computed, yielding the thesis
f_{X}(x) = \frac{ \Gamma \left(\frac{p+q}{2} \right) }{ \Gamma \left( \frac{p}{2} \right) \Gamma \left( \frac{q}{2} \right) }
\left( \frac{p}{q} \right)^{p/2} \,
\frac{ x^{ (p/2) - 1 } }{ [ 1 + (p/q) x ]^{(p+q)/2} }
Properties of F-distribution
Theorem
Suppose X \sim F_{p,q} with q>2. Then
{\rm I\kern-.3em E}[X] = \frac{q}{q-2}
If X \sim F_{p,q} then 1/X \sim F_{q,p}
If X \sim t_q then X^2 \sim F_{1,q}
Properties of F-distribution
Proof of Theorem
Requires a bit of work (omitted)
By the Theorem in Slide 6, we have
X \sim F_{p,q} \quad \implies \quad X = \frac{U/p}{V/q}
with U \sim \chi_p^2 and V \sim \chi_q^2 independent. Therefore
\frac{1}{X} = \frac{V/q}{U/p} \sim \frac{\chi^2_q/q}{\chi^2_p/p} \sim F_{q,p}
Properties of F-distribution
Proof of Theorem
Suppose X \sim t_q. The Theorem in Slide 118 of Lecture 2, guarantees that
X = \frac{U}{\sqrt{V/q}}
where U \sim N(0,1) and V \sim \chi_q^2 are independent. Therefore
X^2 = \frac{U^2}{V/q}
Properties of F-distribution
Proof of Theorem
Since U \sim N(0,1), by definition U^2 \sim \chi_1^2.
Moreover U^2 and V are independet, since U and V are independent
Finally, the Theorem in Slide 6 implies
X^2 = \frac{U^2}{V/q} \sim \frac{\chi_1^2/1}{\chi_q^2/q} \sim F_{1,q}
Part 2: Two-sample F-test
Variance estimators
Suppose given independent random samples from 2 normal populations:
X_1, \ldots, X_n iid random sample from N(\mu_X, \sigma_X^2)
Y_1, \ldots, Y_m iid random sample from N(\mu_Y, \sigma_Y^2)
Problem:
We want to compare variance of the 2 populations
We do it by studying the variances ratio
\frac{\sigma_X^2}{\sigma_Y^2}
Variance estimators
Question:
Suppose the variances \sigma_X^2 and \sigma_Y^2 are unknown
How can we estimate the ratio \sigma_X^2 /\sigma_Y^2 \, ?
Answer:
Estimate the ratio \sigma_X^2 /\sigma_Y^2 \, using sample variances
S^2_X / S^2_Y
The F-distribution allows to compare the quantities
\sigma_X^2 /\sigma_Y^2 \qquad \text{and} \qquad S^2_X / S^2_Y
Variance ratio distribution
Theorem
Suppose given independent random samples from 2 normal populations:
X_1, \ldots, X_n iid random sample from N(\mu_X, \sigma_X^2)
Y_1, \ldots, Y_m iid random sample from N(\mu_Y, \sigma_Y^2)
The random variable
F = \frac{ S_X^2 / \sigma_X^2 }{ S_Y^2 / \sigma_Y^2 } \, \sim \, F_{n-1,m-1}
that is, F is F-distributed with n-1 and m-1 degrees of freedom
Variance ratio distribution
Proof
We need to prove
F = \frac{ S_X^2 / \sigma_X^2 }{ S_Y^2 / \sigma_Y^2 } \sim F_{n-1,m-1}
By the Theorem in Slide 101 Lecture 2, we have that
\frac{S_X^2}{ \sigma_X^2} \sim \frac{\chi_{n-1}^2}{n-1} \,, \qquad
\frac{S_Y^2}{ \sigma_Y^2} \sim \frac{\chi_{m-1}^2}{m-1}
Variance ratio distribution
Proof
Therefore
F = \frac{ S_X^2 / \sigma_X^2 }{ S_Y^2 / \sigma_Y^2 } = \frac{U/p}{V/q}
where we have
U \sim \chi_{p}^2 \,, \qquad
V \sim \chi_q^2 \,, \qquad
p = n-1 \,, \qquad
q = m - 1
By the Theorem in Slide 6, we infer the thesis
F = \frac{U/p}{V/q} \sim F_{n-1,m-1}
Unbiased estimation of variance ratio
Question: Why is S_X^2/S_Y^2 a good estimator for \sigma_X^2/\sigma_Y^2
Answer:
Because S_X^2/S_Y^2 is (asymptotically) unbiased estimator of \sigma_X^2/\sigma_Y^2
This is shown in the following Theorem
Unbiased estimation of variance ratio
Theorem
Suppose given independent random samples from 2 normal populations:
X_1, \ldots, X_n iid random sample from N(\mu_X, \sigma_X^2)
Y_1, \ldots, Y_m iid random sample from N(\mu_Y, \sigma_Y^2)
The hypotheses for difference in variance are
H_0 \colon \sigma_X^2 = \sigma_Y^2 \,, \quad \qquad
H_1 \colon \sigma_X^2 \neq \sigma_Y^2 \quad \text{ or } \quad
H_1 \colon \sigma_X^2 > \sigma_Y^2
Very important: We only consider two-sided and right-tailed hypotheses
This is because we can always label the samples in a way that s_X^2 \geq s_Y^2
Therefore, there is no reason to suspect that \sigma_X^2 < \sigma_Y^2
This allows us to work only with upper quantiles
(and avoid a lot of trouble, as the F-distribution is asymmetric)
Procedure: 3 Steps
Calculation: Compute the two-sample F-statistic
F = \frac{ s_X^2}{ s_Y^2}
where sample variances are
s_X^2 = \frac{\sum_{i=1}^n x_i^2 - n \overline{x}^2}{n-1}
\qquad \quad
s_Y^2 = \frac{\sum_{i=1}^m y_i^2 - m \overline{y}^2}{m-1}
Very important:s_X^2 refers to the sample with largest variance
\implies \quad s_X^2 \geq s_Y^2 \,, \qquad F \geq 1
Table 4: Lists the values F_{\nu_1,\nu_2}(0.025), which means
P(X > F_{\nu_1,\nu_2}(0.025)) = 0.025 \,, \qquad \text{ when } \quad X \sim F_{\nu_1,\nu_2}
For example F_{9, 6}(0.025) = 5.52
What about missing values?
Sometimes the value F_{\nu_1,\nu_2}(\alpha) is missing from F-table
In such case approximate F_{\nu_1,\nu_2}(\alpha) with average of closest entries available
Example:F_{21,5}(0.05) is missing. We can approximate it by
F_{21,5}(0.05) \approx \frac{F_{20,5}(0.05) + F_{25,5}(0.05)}{2} = \frac{ 4.56 + 4.52 }{ 2 } = 4.54
The two-sample F-test in R
Store the samples x_1,\ldots,x_n and y_1,\ldots,y_m in two R vectors
x <- c(x1, ..., xn)
y <- c(y1, ..., ym)
Perform a two-sample F-test on x and y
Alternative
R command
\sigma_X^2 \neq \sigma_Y^2
var.test(x, y)
\sigma_X^2 > \sigma_Y^2
var.test(x, y, alt = "greater")
Read output: similar to two-sample t-test
The main quantity of interest is p-value
Part 3: Worked Example
Data: Wages of 10 Mathematicians and 13 Accountants (again!)
Assumptions: Wages are independent and normally distributed
Mathematicians
36
40
46
54
57
58
59
60
62
63
Accountants
37
37
42
44
46
48
54
56
59
60
60
64
64
Last week, we conducted a two-sample t-test for equality of means
H_0 \colon \mu_X = \mu_Y \,, \qquad
H_1 \colon \mu_X \neq \mu_Y
We concluded that there is no evidence (p>0.05) of difference in pay levels
However, the two-sample t-test assumed equal variance
\sigma_X^2 = \sigma_Y^2
We checked this assumption by plotting the estimated sample distributions
View the R Code
# Enter the Wages datamathematicians <-c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)accountants <-c(37, 37, 42, 44, 46, 48, 54, 56, 59, 60, 60, 64, 64)# Compute the estimated distributionsd.math <-density(mathematicians)d.acc <-density(accountants)# Plot the estimated distributionsplot(d.math, # Plot d.mathxlim =range(c(d.math$x, d.acc$x)), # Set x-axis rangeylim =range(c(d.math$y, d.acc$y)), # Set y-axis rangemain ="Estimated Distributions of Wages") # Add title to plotlines(d.acc, # Layer plot of d.acclty =2) # Use different line stylelegend("topleft", # Add legend at top-leftlegend =c("Mathematicians", # Labels for legend"Accountants"), lty =c(1, 2)) # Assign curves to legend
The plot shows that the two populations have similar variance (spread)
Thanks to the F-test, we can now compare variances in a quantitative way
Setting up the F-test
We want to test for equality of the two variances
There is no prior reason to believe that pay in one group is more spread out
Therefore, a two-sided test is appropriate
H_0 \colon \sigma_X^2 = \sigma_Y^2 \,, \qquad
H_1 \colon \sigma_X^2 \neq \sigma_Y^2
First task is to compute the F-statistic
F = \frac{s_X^2}{s_Y^2}
Important: We need to make sure the samples are labelled so that
s_X^2 \geq s_Y^2
1. Calculation
Variance of first sample (Already done last week!)
There is no evidence (p > 0.05) in favor of H_1. We have no reason to doubt that
\sigma_X^2 = \sigma_Y^2
Conclusion:
Wage levels for the two groups appear to be equally well spread out
This is in line with previous graphical checks
The F-test in R
We present two implementations in R:
Simple solution using the command var.test
A first-principles construction closer to our earlier hand calculation
Simple solution: var.test
This is a two-sided F-test. The p-value is computed by
p = 2 P(F_{m-1, n-1} > F) \,, \qquad F = \frac{s_Y^2}{s_X^2}
# Enter Wages data in 2 vectors using function c()mathematicians <-c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)accountants <-c(37, 37, 42, 44, 46, 48, 54, 56, 59, 60, 60, 64, 64)# Perform two-sided F-test using var.test# Store result and printans <-var.test(accountants, mathematicians)print(ans)
Note: accountants goes first because it has larger variance
F test to compare two variances
data: accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.9505
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2742053 3.6443547
sample estimates:
ratio of variances
1.060686
Comments:
First line: R tells us that an F-test is performed
Second line: Data for F-test is accountants and mathematicians
The F-statistic computed is F = 1.0607
Note: This coincides with the one computed by hand (up to rounding error)
F test to compare two variances
data: accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.9505
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2742053 3.6443547
sample estimates:
ratio of variances
1.060686
Comments:
Numerator of F-statistic has 12 degrees of freedom
Denominator of F-statistic has 9 degrees of freedom
p-value is p = 0.9505
F test to compare two variances
data: accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.9505
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2742053 3.6443547
sample estimates:
ratio of variances
1.060686
Comments:
Fourth line: The alternative hypothesis is that ratio of variances is \, \neq 1
This translates to H_1 \colon \sigma_Y^2 \neq \sigma^2_X
Warning: This is not saying to reject H_0 – R is just stating H_1
F test to compare two variances
data: accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.9505
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2742053 3.6443547
sample estimates:
ratio of variances
1.060686
Comments:
Fifth line: R computes a 95 \% confidence interval for ratio \sigma_Y^2/\sigma_X^2
(\sigma_Y^2/\sigma_X^2 ) \in [0.2742053, 3.6443547]
Interpretation: If you repeat the experiment (on new data) over and over, the interval will contain \sigma_Y^2/\sigma_X^2 about 95\% of the times
F test to compare two variances
data: accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.9505
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2742053 3.6443547
sample estimates:
ratio of variances
1.060686
Comments:
Seventh line: R computes ratio of sample variances
We have that s_Y^2/s_X^2 = 1.060686
By definition, the above coincides with the F-statistic (up to rounding)
F test to compare two variances
data: accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.9505
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2742053 3.6443547
sample estimates:
ratio of variances
1.060686
Conclusion: The p-value is p = 0.9505
Since p > 0.05, we do not reject H_0
Hence \sigma^2_X and \sigma^2_Y appear to be similar
Wage levels for the two groups appear to be equally well spread out
First principles solution
Start by entering data into R
# Enter Wages data in 2 vectors using function c()mathematicians <-c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)accountants <-c(37, 37, 42, 44, 46, 48, 54, 56, 59, 60, 60, 64, 64)
First principles solution
Check which population has higher variance
In our case accountants has higher variance
# Check which variance is highercat("\n Variance of accountants is", var(accountants))cat("\n Variance of mathematicians is", var(mathematicians))
Comments on command
t.test(x, y)mu = mu0tells R to test null hypothesis: H_0 \colon \mu_X - \mu_Y = \mu_0 \qquad \quad (\text{default is } \, \mu_0 = 0)var.equal = Ttells R to assume that populations have same variance \sigma_X^2 = \sigma^2_YIn this case R computes the t-statistic with formula discussed earlier t = \frac{ \overline{x} - \overline{y} }{s_p \sqrt{ \dfrac{1}{n} + \dfrac{1}{m} }}