Chi-squared test for given probabilities
data: counts
X-squared = 20.359, df = 3, p-value = 0.000143
Lecture 5
Data: in the form of numerical counts
Test: difference between observed counts and predictions of theoretical model
Example: Blood counts
We conducted blood type testing on a sample of 6004 individuals, and the results are summarized below.
| A | B | AB | O |
|---|---|---|---|
| 2162 | 738 | 228 | 2876 |
We want to compare the above data to the theoretical probability model
| A | B | AB | O |
|---|---|---|---|
| 1/3 | 1/8 | 1/24 | 1/2 |
| Manager | Won | Drawn | Lost |
|---|---|---|---|
| Moyes | 27 | 9 | 15 |
| Van Gaal | 54 | 25 | 24 |
| Mourinho | 84 | 32 | 28 |
| Solskjaer | 91 | 37 | 40 |
| Rangnick | 11 | 10 | 8 |
| ten Hag | 61 | 12 | 28 |
Example: Relative performance of Manchester United managers
| Manager | Won | Drawn | Lost |
|---|---|---|---|
| Moyes | 27 | 9 | 15 |
| Van Gaal | 54 | 25 | 24 |
| Mourinho | 84 | 32 | 28 |
| Solskjaer | 91 | 37 | 40 |
| Rangnick | 11 | 10 | 8 |
| ten Hag | 61 | 12 | 28 |
Questions:
| Type 1 | Type 2 | \ldots | Type n |
|---|---|---|---|
| O_1 | O_2 | \ldots | O_n |
| Type 1 | Type 2 | \ldots | Type n |
|---|---|---|---|
| E_1 | E_2 | \ldots | E_n |
Goal: Compare expected counts E_i with observed counts O_i
Null hypothesis: Expected counts match the Observed counts
H_0 \colon O_i = E_i \,, \qquad \forall \, i = 1, \ldots, n
Method: Look for evidence against the null hypothesis
Distance between observed counts and expected counts is large
For example, if (O_i - E_i)^2 \geq c for some chosen constant c
Remark:
\chi^2 represents the global distance between observed and expected counts
Indeed, we have that \chi^2 = 0 \qquad \iff \qquad O_i = E_i \,\,\,\, \text{ for all } \,\,\,\, i = 1 , \, \ldots , \, n
Null hypothesis: Expected counts match the Observed counts
H_0 \colon O_i = E_i \,, \qquad \forall \, i = 1, \ldots, n
Remarks:
Models the following experiment
| Outcome type | 1 | \ldots | n | Total |
|---|---|---|---|---|
| Counts | X_1 | \ldots | X_n | X_1 + \ldots + X_n = m |
| Probabilities | p_1 | \ldots | p_n | p_1 + \ldots + p_n = 1 |
For n = 2, the multinomial reduces to a binomial:
| Outcome types | 1 | 2 |
|---|---|---|
| Counts | X_1 | X_2 = m - X_1 |
| Probabilities | p_1 | p_2 = 1 - p_1 |
Suppose that \mathbf{X}= (X_1, \ldots, X_n) \sim \mathop{\mathrm{Mult}}(m,p_1,\ldots,p_n)
If we are only interested in outcome i, the remaining outcomes are failures
This means X_i is binomial with m trials and success probability p_i
We write X_i \sim \mathop{\mathrm{Bin}}(m,p_i) and the pmf is f(x_i) = P(X = x_i) = \frac{m!}{x_i! \cdot (1-x_i)!} \, p_i^{x_i} (1-p_i)^{1-x_i} \qquad \forall \, x_i = 0 , \ldots , m
Since X_i \sim \mathop{\mathrm{Bin}}(m,p_i) it holds {\rm I\kern-.3em E}[X_i] = m p_i \qquad \qquad {\rm Var}[X_i] = m p_i (1-p_i)
O_i refers to observed count of category i
E_i refers to expected count of category i
We suppose that Type i is observed with probability p_i and 0 \leq p_i \leq 1 \,, \qquad \quad p_1 + \ldots + p_n = 1
Total number of observations is m
The counts are modelled by (O_1, \ldots, O_n) \sim \mathop{\mathrm{Mult}}(m, p_1, \ldots, p_n)
The expected counts are modelled by E_i := {\rm I\kern-.3em E}[ O_i ] = m p_i
Consider counts and expected counts (O_1, \ldots, O_n) \sim \mathop{\mathrm{Mult}}(m, p_1, \ldots, p_n) \qquad \qquad E_i := m p_i
Question: What is the distribution of \chi^2 \,?
Since O_i \sim \mathop{\mathrm{Bin}}(m, p_i), the Central Limit Theorem implies \frac{O_i - {\rm I\kern-.3em E}[O_i]}{ \sqrt{{\rm Var}[O_i] } } = \frac{O_i - m p_i }{ \sqrt{m p_i(1 - p_i) } } \ \stackrel{{\rm d}}{\longrightarrow} \ N(0,1) as m \to \infty
In particular, since (1-p_i) in constant, we have \frac{O_i - m p_i }{ \sqrt{m p_i } } \ \approx \ \frac{O_i - m p_i }{ \sqrt{m p_i(1 - p_i) } } \ \approx \ N(0,1)
Squaring the previous expression, we get \frac{(O_i - m p_i)^2 }{ m p_i } \ \approx \ N(0,1)^2 = \chi_1^2
If the above random variables were pairwise independent, we would obtain \chi^2 = \sum_{i=1}^n \frac{(O_i - m p_i)^2 }{ m p_i } \ \approx \ \sum_{i=1}^n \chi_1^2 = \chi_n^2
However the O_i are not independent, because of the linear constraint O_1 + \ldots + O_n = m (total counts have to sum to m)
A priori, there should be n degrees of freedom
However, the linear constraint reduces degrees of freedom by 1
Thus, we have n-1 degrees of freedom, implying that \chi^2 = \sum_{i=1}^n \frac{(O_i - m p_i)^2 }{ m p_i } \ \approx \ \chi_{n-1}^2
This is not a proof! The actual proof is super technical
Define expected counts E_i := m p_i
Consider approximation from Theorem: \chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2 }{ E_i } \ \approx \ \chi_{n-1}^2
The approximation is:
| Good | E_i \geq 5 \, \text{ for all } \, i = 1 , \ldots n |
|---|---|
| Bad | E_i < 5 \, for some \, i = 1 , \ldots n |
How to compute the p-value: When approximation is
Setting:
Problem: p_i is unknown and needs to be estimated
Hypothesis: As guess for p_1,\ldots,p_n, we take p_1^0, \ldots, p_n^0 such that 0 \leq p_i^0 \leq 1 \qquad \qquad \sum_{i=1}^n p_i^0 = 1
Formal Hypothesis: We test for equality of p_i to p_i^0 \begin{align*} & H_0 \colon p_i = p_i^0 \qquad \text{ for all } \, i = 1, \ldots, n \\ & H_1 \colon p_i \neq p_i^0 \qquad \text{ for at least one } \, i \end{align*}
Sample:
Data: Vector of counts (o_1,\ldots,o_n)
Schematically: We can represent probabilities and counts in a table
| Type | 1 | \ldots | n | Total |
|---|---|---|---|---|
| Counts | o_1 | \ldots | o_n | m |
| Probabilities | p_1 | \ldots | p_n | 1 |
| Alternative | Rejection Region | p-value |
|---|---|---|
| \exists \, i \,\, s.t. \,\, p_i \neq p_i^0 | \chi^2 > \chi^2_{n-1}(0.05) | P(\chi_{n-1}^2 > \chi^2) |
Reject H_0 if \chi^2 statistic falls in the Rejection Region (in gray)
counts <- c(o1, ..., on)null.p <- c(p1, ..., pn)counts with null.p| Alternative | R command |
|---|---|
| \exists \, i \,\, s.t. \,\, p_i \neq p_i^0 | chisq.test(counts, p = null.p) |
By default, R computes p-value via the \chi_{n-1}^2 approximation
R will warn you if expected counts are too low, and tell you to use Monte Carlo
To compute p-value via Monte Carlo simulation, use option
simulate.p.value = TSuppose to have counts of blood type for some people
We also have theoretical probabilities that we want to test
| Blood type | A | B | AB | O |
|---|---|---|---|---|
| Count | 2162 | 738 | 228 | 2876 |
| Probability | 1/3 | 1/8 | 1/24 | 1/2 |
Hence the null hypothesis probabilities are p_1^0 = \frac{1}{3} \qquad \quad p_2^0 = \frac{1}{8} \qquad \quad p_3^0 = \frac{1}{24} \qquad \quad p_4^0 = \frac{1}{2}
The alternative hypothesis is: \qquad H_1 \colon p_i \neq p_i^0 \,\, for at least one \, i
Conclusion: Observed counts suggest at least one of null probabilities p_i^0 is wrong
We use R to perform a goodness-of-fit test on alternative H_1 \colon p_i \neq p_i^0 \qquad \text{ for at least one } \, i
The code can be downloaded here good_fit.R
Chi-squared test for given probabilities
data: counts
X-squared = 20.359, df = 3, p-value = 0.000143
Chi-squared statistic is \, \chi^2 = 20.359
Degrees of freedom are \, {\rm df} = 3
p-value is p \approx 0
Therefore p < 0.05
We reject H_0
Conclusion: At least one of the theoretical probabilities appears to be wrong
| Outcome | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Count | 13 | 17 | 9 | 17 | 18 | 26 |
P(\text{rolling }\, i) = \frac16 \,, \quad \forall \, i = 1, \ldots, 6
# Enter counts and null hypothesis probabilities
counts <- c(13, 17, 9, 17, 18, 26)
null_p. <- rep(1/6, 6)
# Perform goodness-of-fit test
chisq.test(counts, p = null.p)Note that each type is equally likely
Therefore, we can achieve same result without specifying null probabilities
Chi-squared test for given probabilities
data: counts
X-squared = 9.68, df = 5, p-value = 0.08483
Chi-squared statistic is \, \chi^2 = 9.68
Degrees of freedom are \, {\rm df} = 5
p-value is p \approx 0.08
Therefore p > 0.05
We cannot reject H_0
Conclusion: The dice appears to be fair
| Republican | Democrat | Undecided |
|---|---|---|
| 35 | 40 | 25 |
Hystorical data suggest that undecided voters are 30\% of population
Exercise: Is difference between Republican and Democratic significant?
chisq.testHystorical data suggest that undecided voters are 30\% of population. Hence p_3^0 = 0.3
Want to test if there is difference between Republican and Democrat
Hence null hypothesis is p_1^0 = p_2^0
Since probabilities must sum to 1, we get p_1^0 + p_2^0 + p_3^0 = 1 \quad \implies \quad p_1^0 = 0.35\,, \qquad p_2^0 = 0.35 \,, \qquad p_3^0 = 0.3
We test the hypothesis H_0 \colon p_i = p_i^0 \,\, \text{ for all } \, i = 1, \ldots, 3 \,, \qquad H_1 \colon p_i \neq p_i^0 \, \text{ for some }\, i
Perform goodness-of-fit test without using chisq.test
First, we enter the data
# Enter counts and null hypothesis probabilities
counts <- c(35, 40 , 25)
null.p <- c(0.35, 0.35, 0.3)# Print expected counts with a message
cat("The expected counts are:", exp.counts)
# Check if the expected counts are larger than 5
if (all(exp.counts >= 5)) {
cat("Expected counts are larger than 5.",
"\nThe chi-squared approximation is valid!")
} else {
cat("Warning, low expected counts.",
"\nMonte Carlo simulation must be used.")
}The expected counts are: 35 35 30
Expected counts are larger than 5.
The chi-squared approximation is valid!
We know that all the counts are larger than 5. Thus \chi^2 \approx \chi_{n-1}^2
We can therefore compute the p-valued with the formula p = P( \chi_{n-1}^2 > \chi^2 ) = 1 - P( \chi_{n-1}^2 \leq \chi^2 )
The full code can be downloaded here good_fit_first_principles.R
Running the code gives the following output
The expected counts are: 35 35 30
Expected counts are larger than 5.
The chi-squared approximation is valid!
The p-value is: 0.4612526
Therefore p-value is p > 0.05
We do not reject H_0
Conclusion: No reason to believe that Republicans and Democrats are not tied
Comment 1
Default null probabilities
If null probabilities are not specified:
R assumes equal probability for each type
For example, assume that
counts <- c(o1, ..., on)Equal probability means that p_i^0 = \frac{1}{n} \,, \qquad \forall \, i = 1, \ldots, n
Test
countsagainst equal probabilities with the commandchisq.test(counts)