Pearson's Chi-squared test
data: counts
X-squared = 228.11, df = 2, p-value < 2.2e-16
Lecture 6
| Owned | Rented | Total | |
|---|---|---|---|
| North West | 2180 | 871 | 3051 |
| London | 1820 | 1400 | 3220 |
| South West | 1703 | 614 | 2317 |
| Total | 5703 | 2885 | 8588 |
Consider a two-way contingency table as above
To each person we associate two categories:
| Owned | Rented | Total | |
|---|---|---|---|
| North West | 2180 | 871 | 3051 |
| London | 1820 | 1400 | 3220 |
| South West | 1703 | 614 | 2317 |
| Total | 5703 | 2885 | 8588 |
| Owned | Rented | Total | |
|---|---|---|---|
| North West | 2180 | 871 | 3051 |
| London | 1820 | 1400 | 3220 |
| South West | 1703 | 614 | 2317 |
| Total | 5703 | 2885 | 8588 |
| Y = 1 | \ldots | Y = C | Totals | |
|---|---|---|---|---|
| X = 1 | O_{11} | \ldots | O_{1C} | O_{1+} |
| \cdots | \cdots | \cdots | \cdots | \cdots |
| X = R | O_{R1} | \ldots | O_{RC} | O_{R+} |
| Totals | O_{+1} | \ldots | O_{+C} | m |
Consider the general two-way contingency table as above
They are equivalent:
| Y = 1 | \ldots | Y = C | Totals | |
|---|---|---|---|---|
| X = 1 | O_{11} | \ldots | O_{1C} | O_{1+} |
| \cdots | \cdots | \cdots | \cdots | \cdots |
| X = R | O_{R1} | \ldots | O_{RC} | O_{R+} |
| Totals | O_{+1} | \ldots | O_{+C} | m |
Hence, testing for independece means testing following hypothesis: \begin{align*} & H_0 \colon X \, \text{ and } \, Y \, \text{ are independent } \\ & H_1 \colon X \, \text{ and } \, Y \, \text{ are not independent } \end{align*}
We need to quantify H_0 and H_1
| Y = 1 | \ldots | Y = C | Totals | |
|---|---|---|---|---|
| X = 1 | p_{11} | \ldots | p_{1C} | p_{1+} |
| \cdots | \cdots | \cdots | \cdots | \cdots |
| X = R | p_{R1} | \ldots | p_{RC} | p_{R+} |
| Totals | p_{+1} | \ldots | p_{+C} | 1 |
R.v. X and Y are independent iff cell probabilities factor into marginals p_{ij} = P(X = i , Y = j) = P(X = i) P (Y = j) = p_{i+}p_{+j}
Therefore the hypothesis test for independence becomes \begin{align*} & H_0 \colon p_{ij} = p_{i+}p_{+j} \quad \text{ for all } \, i = 1, \ldots , R \, \text{ and } \, j = 1 ,\ldots C \\ & H_1 \colon p_{ij} \neq p_{i+}p_{+j} \quad \text{ for some } \, (i,j) \end{align*}
Assume the null hypothesis is true H_0 \colon p_{ij} = p_{i+}p_{+j} \quad \text{ for all } \, i = 1, \ldots , R \, \text{ and } \, j = 1 ,\ldots C
Under H_0, the expected counts become E_{ij} = m p_{ij} = p_{i+}p_{+j}
We need a way to estimate the marginal probabilities p_{i+} and p_{+j}
Goal: Estimate marginal probabilities p_{i+} and p_{+j}
By definition we have p_{i+} = P( X = i )
Hence p_{i+} is probability of and observation to be classified in i-th row
Estimate p_{i+} with the proportion of observations classified in i-th row p_{i+} := \frac{o_{i+}}{m} = \frac{ \text{Number of observations in } \, i\text{-th row} }{ \text{ Total number of observations} }
Goal: Estimate marginal probabilities p_{i+} and p_{+j}
Similarly, by definition p_{+j} = P( Y = j )
Hence p_{+j} is probability of and observation to be classified in j-th column
Estimate p_{+j} with the proportion of observations classified in j-th column p_{+j} := \frac{o_{+j}}{m} = \frac{ \text{Number of observations in } \, j\text{-th column} }{ \text{ Total number of observations} }
Summary: The marginal probabilities are estimated with p_{i+} := \frac{o_{i+}}{m} \qquad \qquad p_{+j} := \frac{o_{+j}}{m}
Therefore the expected counts become E_{ij} = m p_{ij} = m p_{i+} p_{+j} = m \left( \frac{o_{i+}}{m} \right) \left( \frac{o_{+j}}{m} \right) = \frac{o_{i+} \, o_{+j}}{m}
By the Theorem in Slide 20, the chi-squared statistics satisfies \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - E_{ij})^2 }{ E_{ij} } \ \approx \ \chi^2_{RC - 1 - {\rm fitted}}
We need to compute the number of fitted parameters
We estimate the first R-1 row marginals by p_{i+} := \frac{o_{i+}}{m} \,, \qquad i = 1 , \ldots, R - 1
Since the marginals p_{i+} sum to 1, we can obtain p_{R+} by p_{R+} = 1 - p_{1+} - \ldots - p_{(R-1)+} = 1 - \frac{o_{1+}}{m} - \ldots - \frac{o_{(R-1)+}}{m}
Similarly, we estimate the first C-1 column marginals by p_{+j} = \frac{o_{+j}}{m} \,, \qquad j = 1, \ldots, C - 1
Since the marginals p_{+j} sum to 1, we can obtain p_{+C} by p_{+C} = 1 - p_{+1} - \ldots - p_{+(C-1)} = 1 - \frac{o_{+1}}{m} - \ldots - \frac{o_{+(C-1)}}{m}
In total, we only have to estimate (R - 1) + (C - 1 ) = R + C - 2 parameters
Therefore, the fitted parameters are {\rm fitted} = R + C - 2
Consequently, the degrees of freedom are \begin{align*} RC - 1 - {\rm fitted} & = RC - 1 - R - C + 2 \\ & = RC - R - C + 1 \\ & = (R - 1)(C - 1) \end{align*}
Assume the null hypothesis of row and column independence H_0 \colon p_{ij} = p_{i+}p_{+j} \quad \text{ for all } \, i = 1, \ldots , R \, \text{ and } \, j = 1 ,\ldots C
Suppose the counts are O \sim \mathop{\mathrm{Mult}}(m,p), and expected counts are E_{ij} = \frac{o_{i+} \, o_{+j}}{m}
By the previous slides, and Theorem in Slide 20 we have that \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - E_{ij})^2 }{ E_{ij} } \approx \ \chi_{RC - 1 - {\rm fitted} }^2 = \chi_{(R-1)(C-1)}^2
The chi-squared approximation \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - E_{ij})^2 }{ E_{ij} } \approx \ \chi_{(R-1)(C-1)}^2 is good if E_{ij} \geq 5 \quad \text{ for all } \,\, i = 1 , \ldots , R \, \text{ and } j = 1, \ldots, C
Sample:
We draw m individuals from population
Each individual can be of type (i,j), where
O_{ij} denotes the number of items of type (i,j) drawn
p_{ij} is probability of observing type (i,j)
p = (p_{ij})_{ij} is probability matrix
The matrix of counts O = (o_{ij})_{ij} has multinomial distribution \mathop{\mathrm{Mult}}(m,p)
Data: Matrix of counts, represented by two-way contigency table
| Y = 1 | \ldots | Y = C | Totals | |
|---|---|---|---|---|
| X = 1 | o_{11} | \ldots | o_{1C} | o_{1+} |
| \cdots | \cdots | \cdots | \cdots | \cdots |
| X = R | o_{R1} | \ldots | o_{RC} | o_{R+} |
| Totals | o_{+1} | \ldots | o_{+C} | m |
Hypothesis test: We test for independence of rows and columns \begin{align*} & H_0 \colon X \, \text{ and } \, Y \, \text{ are independent } \\ & H_1 \colon X \, \text{ and } \, Y \, \text{ are not independent } \end{align*}
Compute marginal and total counts o_{i+} := \sum_{j=1}^C o_{ij} \,, \qquad o_{+j} := \sum_{i=1}^R o_{ij} \,, \qquad m = \sum_{i=1}^R o_{i+} = \sum_{j=1}^C o_{+j}
Compute the expected counts E_{ij} = \frac{o_{i+} \, o_{+j} }{ m }
Compute the chi-squared statistic \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (o_{ij} - E_{ij})^2 }{E_{ij}}
| Alternative | Rejection Region | p-value |
|---|---|---|
| X and Y not independent | \chi^2 > \chi^2_{\rm{df}}(0.05) | P(\chi_{\rm df}^2 > \chi^2) |
row_i <- c(o_i1, o_i2, ..., o_iC)counts <- rbind(row_1, ..., row_R)counts with null.p| Alternative | R command |
|---|---|
| X and Y are independent | chisq.test(counts) |
Note: Compute Monte Carlo p-value with option simulate.p.value = TRUE
| Owned | Rented | Total | |
|---|---|---|---|
| North West | 2180 | 871 | 3051 |
| London | 1820 | 1400 | 3220 |
| South West | 1703 | 614 | 2317 |
| Total | 5703 | 2885 | 8588 |
People are sampled at random in NW, London and SW
To each person we associate two categories:
Question: Are Residential Status and Region independent?
| Owned | Rented | Tot | |
|---|---|---|---|
| NW | 2180 | 871 | o_{1+} = 3051 |
| Lon | 1820 | 1400 | o_{2+} = 3220 |
| SW | 1703 | 614 | o_{2+} = 2317 |
| Tot | o_{+1} = 5703 | o_{+2} = 2885 | m = 8588 |
| Owned | Rented | Tot | |
|---|---|---|---|
| NW | \frac{3051{\times}5703}{8588} \approx 2026 | \frac{3051{\times}2885}{8588} \approx 1025 | o_{1+} = 3051 |
| Lon | \frac{3220{\times}5703}{8588} \approx 2138 | \frac{3220{\times}2885}{8588} \approx 1082 | o_{2+} = 3220 |
| SW | \frac{2317{\times}5703}{8588} \approx 1539 | \frac{2317{\times}1885}{8588} \approx 778 | o_{2+} = 2317 |
| Tot | o_{+1} = 5703 | o_{+2} = 2885 | m = 8588 |
| Owned | Rented | |
|---|---|---|
| North West | 71.5\% | 28.5\% |
| London | 56.5\% | 43.5\% |
| South West | 73.5\% | 26.5\% |
Pearson's Chi-squared test
data: counts
X-squared = 228.11, df = 2, p-value < 2.2e-16
| Manager | Won | Drawn | Lost |
|---|---|---|---|
| Moyes | 27 | 9 | 15 |
| Van Gaal | 54 | 25 | 24 |
| Mourinho | 84 | 32 | 28 |
| Solskjaer | 91 | 37 | 40 |
| Rangnick | 11 | 10 | 8 |
| ten Hag | 61 | 12 | 28 |
Table contains Man Utd games since 2014 season (updated to 2024)
To each Man Utd game in the sample we associate two categories:
Question: Is there association between Manager and Team Performance?
To answer the question, we perform chi-squared test of independence in R
Test can be performed by suitably modifying the code independence_test.R
# Store each row into and R vector
row_1 <- c(27, 9, 15)
row_2 <- c(54, 25, 24)
row_3 <- c(84, 32, 28)
row_4 <- c(91, 37, 40)
row_5 <- c(11, 10, 8)
row_6 <- c(61, 12, 28)
# Assemble the rows into an R matrix
counts <- rbind(row_1, row_2, row_3, row_4, row_5, row_6)
# Perform chi-squared test of independence
ans <- chisq.test(counts)
# Print answer
print(ans)
Pearson's Chi-squared test
data: counts
X-squared = 12.678, df = 10, p-value = 0.2422
Vectors can contain only one data type (number, character, boolean)
Lists are data structures that can contain any R object
Lists can be created similarly to vectors, with the command list()
Elements of a list can be retrieved by indexing
my_list[[k]] returns k-th element of my_listYou can return multiple items of a list via slicing
my_list[c(k1, ..., kn)] returns elements in positions k1, ..., knmy_list[k1:k2] returns elements k1 to k2names(my_list) <- c("name_1", ..., "name_k")# Create list with 3 elements
my_list <- list(2, c(T,F,T,T), "hello")
# Name each of the 3 elements
names(my_list) <- c("number", "TF_vector", "string")
# Print the named list: the list is printed along with element names
print(my_list)$number
[1] 2
$TF_vector
[1] TRUE FALSE TRUE TRUE
$string
[1] "hello"
my_list named my_name can be accessed with dollar operator
my_list$my_name# Create list with 3 elements and name them
my_list <- list(2, c(T,F,T,T), "hello")
names(my_list) <- c("number", "TF_vector", "string")
# Access 2nd element using dollar operator and store it in variable
second_component <- my_list$TF_vector
# Print 2nd element
print(second_component)[1] TRUE FALSE TRUE TRUE
names(my_vector) <- c("name_1", ..., "name_k")my_vector["name_k"]Green
3
[1] 3
Data Frames are the preferred way of presenting a data set in R:
Data frames can contain any R object
Data Frames are similar to Lists, with the difference that:
In simpler terms:
Data frames are constructed similarly to lists, using
data.frame()Important: Elements of data frame must be vectors of the same length
Example: We construct the Family Guy data frame. Variables are
person – Name of characterage – Age of charactersex – Sex of characterThink of a data frame as a matrix
You can access element in position (m,n) by using
my_data[m, n]Example
(1,1)[1] "Peter"
To access multiple elements on the same row or column, type
my_data[c(k1,...,kn), m] \quad or \quad my_data[k1:k2, m]my_data[n, c(k1,...,km)] \quad or \quad my_data[n, k1:k2]Example: We want to access Stewie’s Sex and Age:
age sex
5 1 M
k1,...,kn
my_data[c(k1,...,kn), ]k1 to kn
my_data[k1:k2, ] person age sex
1 Peter 42 M
3 Meg 17 F
k1,...,kn
my_data[ , c(k1,...,kn)]k1 to kn
my_data[ ,k1:k2] age person
1 42 Peter
2 40 Lois
3 17 Meg
4 14 Chris
5 1 Stewie
Use dollar operator to access data frame columns
my_data contains a variable called my_variable
my_data$my_variable accesses column my_variablemy_data$my_variable is a vectorExample: To access age in the family data frame type
Ages of the Family Guy characters are: 42 40 17 14 1
Meg's age is 17
| Command | Output |
|---|---|
nrow(my_data) |
# of rows |
ncol(my_data) |
# of columns |
dim(my_data) |
vector containing # of rows and columns |
family_dim <- dim(family) # Stores dimensions of family in a vector
cat("The Family Guy data frame has",
family_dim[1],
"rows and",
family_dim[2],
"columns")The Family Guy data frame has 5 rows and 3 columns
Assume given a data frame my_data
rbind)
new_recordnew_record must match the structure of my_datamy_data with my_data <- rbind(my_data, new_record)cbind)
new_variablenew_variable must have as many components as rows in my_datamy_data with my_data <- cbind(my_data, new_variable)family person age sex
1 Brian 7 M
Important: Names have to match existing data frame
new_record to family person age sex
1 Peter 42 M
2 Lois 40 F
3 Meg 17 F
4 Chris 14 M
5 Stewie 1 M
6 Brian 7 M
familyfunnyfunny with entries matching each character (including Brian)funny to the Family Guy data frame family person age sex funny
1 Peter 42 M High
2 Lois 40 F High
3 Meg 17 F Low
4 Chris 14 M Med
5 Stewie 1 M High
6 Brian 7 M Med
Instead of using cbind we can use dollar operator:
new_variablev containing values for the new variablev must have as many components as rows in my_datamy_data with
my_data$new_variable <- vfamilyfamily$age by 12v <- family$age * 12 # Computes vector of ages in months
family$age.months <- v # Stores vector as new column in family
print(family) person age sex funny age.months
1 Peter 42 M High 504
2 Lois 40 F High 480
3 Meg 17 F Low 204
4 Chris 14 M Med 168
5 Stewie 1 M High 12
6 Brian 7 M Med 84
We saw how to use logical flag vectors to subset vectors
We can use logical flag vectors to subset data frames as well
Suppose to have data frame my_data containing a variable my_variable
Want to subset records in my_data for which my_variable satisfies a condition
Use commands
flag <- condition(my_data$my_variable)my_data[flag, ]familyfamily$sex == "F"[1] FALSE TRUE TRUE FALSE FALSE FALSE
# Subset data frame "family" and store in data frame "subset"
subset <- family[flag, ]
print(subset) person age sex funny age.months
2 Lois 40 F High 480
3 Meg 17 F Low 204
R has a many functions for reading characters from stored files
We will see how to read Table-Format files
Table-Formats are just tables stored in plain-text files
Typical file estensions are:
.txt for plain-text files.csv for comma-separated valuesTable-Formats can be read into R with the command
read.table()NA#
*Table-Formats can be read via read.table()
Reads .txt or .csv files
outputs a data frame
Options of read.table()
header = T/F
na.strings = "string"
"string" means NA (missing values)To read family_guy.txt into R proceed as follows:
Download family_guy.txt and move file to Desktop
Open the R Console and change working directory to Desktop
family_guy.txt into R and store it in data frame family with coderead.table() that
family_guy.txt has a header*family to screen person age sex funny age.mon
1 Peter NA M High 504
2 Lois 40 F <NA> 480
3 Meg 17 F Low 204
4 Chris 14 M Med 168
5 Stewie 1 M High NA
6 Brian NA M Med NA
.txt file
Data: Analysis of Consumer Confidence Index for 2008 crisis (seen in Lecture 3)
c().txt file insteadGoal: Perform t-test on CCI difference for mean difference \mu = 0
read.table()The CCI dataset can be downloaded here 2008_crisis.txt
The text file looks like this
To perform the t-test on data 2008_crisis.txt we proceed as follows:
Download dataset 2008_crisis.txt and move file to Desktop
Open the R Console and change working directory to Desktop
2008_crisis.txt into R and store it in data frame scores with codescores into 2 vectors# CCI from 2007 is stored in 2nd column
score_2007 <- scores[, 2]
# CCI from 2009 is stored in 3nd column
score_2009 <- scores[, 3]# Compute vector of differences
difference <- score_2007 - score_2009
# Perform t-test on difference with null hypothesis mu = 0
t.test(difference, mu = 0)$p.value[1] 4.860686e-13
10 patients treated with both Drug A and Drug B
| i | x_i | y_i |
|---|---|---|
| 1 | 1.9 | 0.7 |
| 2 | 0.8 | -1.0 |
| 3 | 1.1 | -0.2 |
| 4 | 0.1 | -1.2 |
| 5 | -0.1 | -0.1 |
| 6 | 4.4 | 3.4 |
| 7 | 4.6 | 0.0 |
| 8 | 1.6 | 0.8 |
| 9 | 5.5 | 3.7 |
| 10 | 3.4 | 2.0 |
Goal:
Plot:

Linear relation:
Try to fit a line through the data
Line roughly predicts y_i from x_i
However note the outlier (x_7,y_7) = (4.6, 0) (red point)
How is such line constructed?

Least Squares Line:

Least Squares Line:
We would like predicted and actual value to be close \hat{y}_i \approx y_i
Hence the vertical difference has to be small y_i - \hat{y}_i \approx 0

Least Squares Line:
We want \hat{y}_i - y_i \approx 0 \,, \qquad \forall \, i
Can be achieved by minimizing the sum of squares \min_{\alpha, \beta} \ \sum_{i} \ (y_i - \hat{y}_i)^2 \hat{y}_i = \beta x_i + \alpha

Note: \mathop{\mathrm{RSS}} can be seen as a function \mathop{\mathrm{RSS}}\colon \mathbb{R}^2 \to \mathbb{R}\qquad \quad \mathop{\mathrm{RSS}}= \mathop{\mathrm{RSS}}(\alpha,\beta)
For a given sample (x_1,y_1), \ldots, (x_n, y_n), define
Sample Means: \overline{x} := \frac{1}{n} \sum_{i=1}^n x_i \qquad \quad \overline{y} := \frac{1}{n} \sum_{i=1}^n y_i
Sums of squares: S_{xx} := \sum_{i=1}^n ( x_i - \overline{x} )^2 \qquad \quad S_{yy} := \sum_{i=1}^n ( y_i - \overline{y} )^2
Sum of cross-products: S_{xy} := \sum_{i=1}^n ( x_i - \overline{x} ) ( y_i - \overline{y} )
Given (x_1,y_1), \ldots, (x_n, y_n), consider the minimization problem \begin{equation} \tag{M} \min_{\alpha,\beta } \ \mathop{\mathrm{RSS}}= \min_{\alpha,\beta} \ \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2 \end{equation} Then
To prove the Theorem we need some background results
A symmetric matrix is positive semi-definite if all the eigenvalues \lambda_i satisfy \lambda_i \geq 0
Proposition: A 2 \times 2 symmetric matrix M is positive semi-definite iff \det M \geq 0 \,, \qquad \quad \operatorname{Tr}(M) \geq 0
f \colon \mathbb{R}^2 \to \mathbb{R}\qquad \quad f = f (x,y)
\nabla^2 f = \left( \begin{array}{cc} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \\ \end{array} \right)
\det \nabla^2 f = f_{xx} f_{yy} - f_{xy}^2 \geq 0 \qquad \quad f_{xx} + f_{yy} \geq 0
\nabla^2 f \, \text{ is positive semi-definite} \qquad \iff \qquad f \, \text{ is convex}
Suppose f \colon \mathbb{R}^2 \to \mathbb{R} has positive semi-definite Hessian. They are equivalent
The point (\hat{x},\hat{y}) is a minimizer of f, that is, f(\hat{x}, \hat{y}) = \min_{x,y} \ f(x,y)
The point (\hat{x},\hat{y}) satisfies the optimality conditions \nabla f (\hat{x},\hat{y}) = 0
Note: The proof of the above Lemma can be found in [1]
f(x,y) = x^2 + y^2
It is clear that \min_{x,y} \ f(x,y) = \min_{x,y} \ x^2 + y^2 = 0 \,, with the only minimizer being (0,0)
However, let us use the Lemma to prove this fact
\nabla f = (f_x,f_y) = (2x, 2y)
\nabla f = 0 \qquad \iff \qquad x = y = 0
\nabla^2 f = \left( \begin{array}{cc} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{array} \right) = \left( \begin{array}{cc} 2 & 0 \\ 0 & 2 \end{array} \right)
\det (\nabla^2) f = 4 > 0 \qquad \qquad \operatorname{Tr}(\nabla^2 f) = 4 > 0
0 = f(0,0) = \min_{x,y} \ f(x,y)
We go back to proving the RSS Minimization Theorem
Suppose given data points (x_1,y_1), \ldots, (x_n, y_n)
We want to solve the minimization problem
\begin{equation} \tag{M} \min_{\alpha,\beta } \ \mathop{\mathrm{RSS}}= \min_{\alpha,\beta} \ \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2 \end{equation}
\nabla \mathop{\mathrm{RSS}}\quad \text{ and } \quad \nabla^2 \mathop{\mathrm{RSS}}
We first compute \nabla \mathop{\mathrm{RSS}} and solve the optimality conditions \nabla \mathop{\mathrm{RSS}}(\alpha,\beta) = 0
To this end, recall that \overline{x} := \frac{\sum_{i=1}^nx_i}{n} \qquad \implies \qquad \sum_{i=1}^n x_i = n \overline{x}
Similarly, we have \sum_{i=1}^n y_i = n \overline{y}
\begin{align*} \mathop{\mathrm{RSS}}_{\alpha} & = -2\sum_{i=1}^n(y_i- \alpha- \beta x_i) \\[10pt] & = - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} \\[20pt] \mathop{\mathrm{RSS}}_{\beta} & = -2\sum_{i=1}^n x_i (y_i- \alpha - \beta x_i) \\[10pt] & = - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2 \end{align*}
\begin{align} - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} & = 0 \tag{1} \\[20pt] - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2 & = 0 \tag{2} \end{align}
-2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} = 0
\alpha = \overline{y}- \beta \overline{x}
\sum_{i=1}^n x_i y_i - \alpha n \overline{x} - \beta \sum_{i=1}^n x^2_i = 0
Hence Equation (2) is equivalent to \beta = \frac{S_{xy}}{ S_{xx} }
Also recall that Equation (1) is equivalent to \alpha = \overline{y}- \beta \overline{x}
Therefore (\hat\alpha, \hat\beta) solves the optimality conditions \nabla \mathop{\mathrm{RSS}}= 0 iff \hat\alpha = \overline{y}- \hat\beta \overline{x} \,, \qquad \quad \hat\beta = \frac{S_{xy}}{ S_{xx} }
We need to compute \nabla^2 \mathop{\mathrm{RSS}}
To this end recall that \mathop{\mathrm{RSS}}_{\alpha} = - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} \,, \quad \mathop{\mathrm{RSS}}_{\beta} = - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2
Therefore we have \begin{align*} \mathop{\mathrm{RSS}}_{\alpha \alpha} & = 2n \qquad & \mathop{\mathrm{RSS}}_{\alpha \beta} & = 2 n \overline{x} \\ \mathop{\mathrm{RSS}}_{\beta \alpha } & = 2 n \overline{x} \qquad & \mathop{\mathrm{RSS}}_{\beta \beta} & = 2 \sum_{i=1}^{n} x_i^2 \end{align*}
\begin{align*} \det (\nabla^2 \mathop{\mathrm{RSS}}) & = \mathop{\mathrm{RSS}}_{\alpha \alpha}\mathop{\mathrm{RSS}}_{\beta \beta} - \mathop{\mathrm{RSS}}_{\alpha \beta}^2 \\[10pt] & = 4n \sum_{i=1}^{n} x_i^2 - 4 n^2 \overline{x}^2 \\[10pt] & = 4n \left( \sum_{i=1}^{n} x_i^2 - n \overline{x}^2 \right) \\[10pt] & = 4n S_{xx} \end{align*}
S_{xx} = \sum_{i=1}^n (x_i - \overline{x})^2 \geq 0
\det (\nabla^2 \mathop{\mathrm{RSS}}) = 4n S_{xx} \geq 0
\operatorname{Tr}(\nabla^2\mathop{\mathrm{RSS}}) = \mathop{\mathrm{RSS}}_{\alpha \alpha} + \mathop{\mathrm{RSS}}_{\beta \beta} = 2n + 2 \sum_{i=1}^{n} x_i^2 \geq 0
Therefore we have proven \det( \nabla^2 \mathop{\mathrm{RSS}}) \geq 0 \,, \qquad \quad \operatorname{Tr}(\nabla^2\mathop{\mathrm{RSS}}) \geq 0
As the Hessian is symmetric, we conclude that \nabla^2 \mathop{\mathrm{RSS}} is positive semi-definite
By the Lemma, we have that all the solutions (\alpha,\beta) to the optimality conditions \nabla \mathop{\mathrm{RSS}}(\alpha,\beta) = 0 are minimizers
Therefore (\hat \alpha,\hat\beta) with \hat\alpha = \overline{y}- \hat\beta \overline{x} \,, \qquad \quad \hat\beta = \frac{S_{xy}}{ S_{xx} } is a minimizer of \mathop{\mathrm{RSS}}, ending the proof
The previous Theorem motivates the following definition
In R, we want to do the following:
Input the data into vectors x and y
Compute the least-square line coefficients \hat{\alpha} = \overline{y} - \hat{\beta} \ \overline{x} \qquad \qquad \hat{\beta} = \frac{S_{xy}}{S_{xx}}
Plot the data points (x_i,y_i)
Overlay the least squares line
| i | x_i | y_i |
|---|---|---|
| 1 | 1.9 | 0.7 |
| 2 | 0.8 | -1.0 |
| 3 | 1.1 | -0.2 |
| 4 | 0.1 | -1.2 |
| 5 | -0.1 | -0.1 |
| 6 | 4.4 | 3.4 |
| 7 | 4.6 | 0.0 |
| 8 | 1.6 | 0.8 |
| 9 | 5.5 | 3.7 |
| 10 | 3.4 | 2.0 |
We give a first solution using elementary R functions
The code to input the data into a data-frame is as follows
# Input blood pressure changes data into data-frame
changes <- data.frame(drug_A = c(1.9, 0.8, 1.1, 0.1, -0.1,
4.4, 4.6, 1.6, 5.5, 3.4),
drug_B = c(0.7, -1.0, -0.2, -1.2, -0.1,
3.4, 0.0, 0.8, 3.7, 2.0)
)drug_A and drug_B to vectors x and y# Compute least-square line coefficients
beta <- S_xy / S_xx
alpha <- ybar - beta * xbar
# Print coefficients
cat("\nCoefficient alpha =", alpha)
cat("\nCoefficient beta =", beta)
Coefficient alpha = -0.7861478
Coefficient beta = 0.685042
# Plot the data
plot(x, y, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Drug A reaction x_i", side = 1, line = 3, cex = 2.1)
mtext("Drug B reaction y_i", side = 2, line = 2.5, cex = 2.1)Note: We have added a few cosmetic options
pch = 16 plots points with black circlescex = 2 stands for character expansion – Specifies width of pointsmtext is used to fine-tune the axis labels
side = 1 stands for x-axisside = 2 stands for y-axisline specifies distance of label from axisabline(a, b)col specifies color of the plotlwd specifies line widthThe code can be downloaded here least_squares_1.R
Running the code we obtain the plot below
lmlm stands for linear modellm to fit the least-squares line
lm(formula, data)data expects a data-frame in inputformula stands for the relation to fitformula = y ~ xx and y are names of two variables in the data-framey ~ x can be read as: y modelled as function of xNote: Storing data in data-frame is optional
We can just store data in vectors x and y
Then fit the least-squares line with lm(y ~ x)
changes is# Fit least squares line
least_squares <- lm(formula = drug_B ~ drug_A, data = changes)
# Print output
print(least_squares)
Call:
lm(formula = drug_B ~ drug_A, data = changes)
Coefficients:
(Intercept) drug_A
-0.7861 0.6850
- The above tells us that the estimators are
\hat \alpha = -0.7861 \,, \qquad \quad \hat \beta = 0.6850
changes$drug_Achanges$drug_BThe least-squares line is currently stored in least_squares
To add the line to the current plot use abline
The code can be downloaded here least_squares_2.R
Running the code we obtain the plot below