
Lecture 10
In Lecture 7 we have introduced the general linear regression model
Y_i = \beta_1 z_{i1} + \beta_2 z_{i2} + \ldots + \beta_p z_{ip} + \varepsilon_i
Z_1 \, , \,\, \ldots \, , \, Z_p
Y | Z_1 = z_{i1} \,, \,\, \ldots \,, \,\, Z_p = z_{ip}
Predictor is known: The values z_{i1}, \ldots, z_{ip} are known
Normality: The distribution of Y_i is normal
Linear mean: There are parameters \beta_1,\ldots,\beta_p such that {\rm I\kern-.3em E}[Y_i] = \beta_1 z_{i1} + \ldots + \beta_p z_{ip}
Homoscedasticity: There is a parameter \sigma^2 such that {\rm Var}[Y_i] = \sigma^2
Independence: rv Y_1 , \ldots , Y_n are independent, and thus uncorrelated
{\rm Cor}(Y_i,Y_j) = 0 \qquad \forall \,\, i \neq j
Predictor is known: The values z_{i1}, \ldots, z_{ip} are known
Normality: The distribution of \varepsilon_i is normal
Linear mean: The errors have zero mean {\rm I\kern-.3em E}[\varepsilon_i] = 0
Homoscedasticity: There is a parameter \sigma^2 such that {\rm Var}[\varepsilon_i] = \sigma^2
Independence: Errors \varepsilon_1 , \ldots , \varepsilon_n are independent, and thus uncorrelated
{\rm Cor}(\varepsilon_i, \varepsilon_j) = 0 \qquad \forall \,\, i \neq j
Z^T Z \, \text{ is invertible}
\beta = (\beta_1, \ldots, \beta_p)
\hat \beta = (Z^T Z)^{-1} Z^T y
{\rm Var}[\varepsilon_i] \neq {\rm Var}[\varepsilon_j] \qquad \text{ for some } \,\, i \neq j
{\rm Cor}( \varepsilon_i, \varepsilon_j ) \neq 0 \qquad \text{ for some } \,\, i \neq j
Z^T Z
Y_i = \beta_1 z_{i1} + \beta_2 z_{i2} + \ldots + \beta_p z_{ip} + \varepsilon_i
{\rm Cor}(\varepsilon_i , \varepsilon_j) = 0 \qquad \text{ for some } \,\, i \neq j
Recall the methods to assess linear models
The above methods rely heavily on independence
Once again, let us consider the likelihood calculation \begin{align*} L & = f(y_1, \ldots, y_n) = \prod_{i=1}^n f_{Y_i} (y_i) \\[15pts] & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{\sum_{i=1}^n(y_i- \hat y_i)^2}{2\sigma^2} \right) \\[15pts] & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{ \mathop{\mathrm{RSS}}}{2\sigma^2} \right) \end{align*}
The second equality is only possible thanks to independence of
Y_1 , \ldots, Y_n
{\rm Cor}(\varepsilon_i,\varepsilon_j) \neq 0 \quad \text{ for some } \, i \neq j
\varepsilon_i \, \text{ and } \, \varepsilon_j \, \text{ dependent } \quad \implies \quad Y_i \, \text{ and } \, Y_j \, \text{ dependent }
L \neq \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{ \mathop{\mathrm{RSS}}}{2\sigma^2} \right)
In this case \hat \beta does no longer maximize the likelihood!
As already seen, this implies that
\mathop{\mathrm{e.s.e.}}(\beta_j) \,\, \text{ is unreliable}
{\rm Cor}(\varepsilon_i,\varepsilon_j) \neq 0 \quad \text{ for some } \, i \neq j
Autocorrelation if often unavoidable
Typically associated with time series data
Economic time series tend to exhibit cyclical behaviour
Examples: GNP, price indices, production figures, employment statistics etc.
These series tend to be quite slow moving
This is an extremely common phenomenon in financial and economic time series
Characteristic of industries in which a large amount of time passes between
Cobweb phenomenon is common with agricultural commodities
Economic agents (e.g. farmers) decide
\begin{equation} \tag{3} {\rm Supply}_t = \beta_1 + \beta_2 \, {\rm Price}_{t-1} + \varepsilon_t \end{equation}
Errors \varepsilon_t in equation (3) are unlikely to be completely random and patternless
This is because
Error terms are likely to be autocorrelated
Examples:
Quarterly data may smooth out the wild fluctuations in monthly sales figures
Low frequency economic survey data may be interpolated
However: Such data transformations may be inevitable
In social sciences data quality may be variable
This may induce systematic patterns and autocorrelation
No magic solution – Autocorrelation is unavoidable and must be considered
Autocorrelation is often unavoidable. We should be able to detect it:
Graphical and statistical methods can be useful cross-check of each other!
Check to see if any evidence of a systematic pattern exists:
Code for this example is available here autocorrelation.R
Stock Vs Gold prices data is available here gold_stock.txt
Read data into R and fit simple regression
Time-series plot of residuals: Plot the residuals \hat{\varepsilon}_i
Autocorrelation plot of residuals:

[1] 33
Autocorrelation plot of residuals: Plot \hat{\varepsilon}_t against \hat{\varepsilon}_{t-1}
Conclusions: The dataset Gold Price vs Stock Price exhibits Autocorrelation
We do not cover these tests
Y_i = \alpha + \beta x_i + \varepsilon_i
{\rm Cor}(\hat{\varepsilon}_i, \hat{\varepsilon}_j ) \neq 0 \quad \text{ for some } \, i \neq j
\hat{\varepsilon}_t = a + b \, \hat{\varepsilon}_{t-1} + u_t \,, \qquad u_t \,\, \text{ iid } \,\, N(0,\sigma^2)
Y_i = \alpha + \beta x_i + \varepsilon_i
Instead, substitute into the model the relation \hat{\varepsilon}_t = a + b \, \hat{\varepsilon}_{t-1} + u_t
We obtain the new linear model (up to relabelling coefficients)
Y_t = \alpha + \beta x_t + \phi \hat{\varepsilon}_{t-1} + u_t \,, \qquad u_t \,\, \text{ iid } \,\, N(0,\sigma^2)
Y = \beta_1 z_1 + \beta_2 z_2 + \ldots + \beta_n z_n + \varepsilon_i
However, suppose the residuals exhibit autocorrelation
In this case, ordinary least squares assumptions are violated
In this case, instead of regression, we need to use ARIMA models:
Y = \beta_1 z_1 + \beta_2 z_2 + \ldots + \beta_n z_n + \varepsilon_i
y is the given data vector; z1, ..., zn the given predictorsxreg specifies the regressors included in the modelorder specifies the ARIMA(p, d, q) componentorder = c(1, 0, 0)Code for this example is available here arima.R
Stock Vs Gold prices data is available here gold_stock.txt
We want to fit the regression model
\texttt{gold.price} = \alpha + \beta \times (\texttt{stock.price}) + \varepsilon
We already saw that residuals are autocorrelated
Therefore we fit ARIMA model of order 1 – Abbreviated in AR(1)
Output:
Coefficients:
ar1 intercept stock.price
0.5578 3.9406 -0.0487
s.e. 0.1545 0.5675 0.0247
\begin{align*} (\texttt{gold.price})_t & = \alpha + \beta \times (\texttt{stock.price})_t + \varepsilon_t \\[1.0em] \varepsilon_t & = \phi \varepsilon_{t-1} + u_t \,, \qquad u_t \,\, \text{ iid } \,\, N(0,\sigma^2) \end{align*}
| Variable | Parameter | Value | \mathop{\mathrm{e.s.e.}} |
|---|---|---|---|
ar1 |
\phi | 0.5578 | 0.1545 |
intercept |
\alpha | 3.9406 | 0.5675 |
stock.price |
\beta | -0.0487 | 0.0247 |
Output:
Coefficients:
ar1 intercept stock.price
0.5578 3.9406 -0.0487
s.e. 0.1545 0.5675 0.0247
We need to check if the 3 parameters \phi,\alpha,\beta are significant
In order to do that, we need to compute the three t-statistics:
t = \frac{\text{parameter}}{\mathop{\mathrm{e.s.e.}}} \sim t_{\rm df}
p = 2 P(t_{\rm df} > |t|)
Output:
Coefficients:
ar1 intercept stock.price
0.5578 3.9406 -0.0487
s.e. 0.1545 0.5675 0.0247
3.610356 6.943789 -1.97166
0.001100235 1.032974e-07 0.05793185
Conclusion: We fitted the AR(1) model:
\begin{align*} (\texttt{gold.price})_t & = \alpha + \beta \times (\texttt{stock.price})_t + \varepsilon_t \\[1.0em] \varepsilon_t & = \phi \varepsilon_{t-1} + u_t \,, \qquad u_t \,\, \text{ iid } \,\, N(0,\sigma^2) \end{align*}
| Parameter | \phi | \alpha | \beta |
|---|---|---|---|
| Estimate | 0.5578 | 3.9406 | -0.0487 |
| p-value | 0.001100235 | 1.032974e-07 | 0.05793185 |
The p-value for \phi is significant: \,\, p=0.001100235 < 0.05
This shows the residuals are actually autocorrelated, with relationship
\varepsilon_t = (0.001100235) \times \varepsilon_{t-1} + u_t
Conclusion: We fitted the AR(1) model:
\begin{align*} (\texttt{gold.price})_t & = \alpha + \beta \times (\texttt{stock.price})_t + \varepsilon_t \\[1.0em] \varepsilon_t & = \phi \varepsilon_{t-1} + u_t \,, \qquad u_t \,\, \text{ iid } \,\, N(0,\sigma^2) \end{align*}
| Parameter | \phi | \alpha | \beta |
|---|---|---|---|
| Estimate | 0.5578 | 3.9406 | -0.0487 |
| p-value | 0.001100235 | 1.032974e-07 | 0.05793185 |
The p-value for \beta is not significant: \,\, p=0.05793185 > 0.05
However p<0.01, giving weak evidence of a relationship between stock and gold price
This means we can make predictions using the linear model
\texttt{gold.price} = 3.9406 - 0.0487 \times (\texttt{stock.price})
Conclusion: Fitting the AR(1) model gives the relationship
\begin{equation} \texttt{gold.price} = 3.9406 - 0.0487 \times (\texttt{stock.price}) \end{equation}
Coefficients:
(Intercept) stock.price
4.21285 -0.06409
\texttt{gold.price} = 4.21285 - 0.06409 \times (\texttt{stock.price})
Therefore the AR(1) estimate in (1) is preferred
y_t = \alpha + \beta x_t + \varepsilon_t
The AR(1) model assumes the residuals form a stationary time series
A time series is stationary if its behavior is consistent over time
Specifically, two properties should remain roughly constant:
For example, let us examine the residuals \varepsilon_t from the ordinary linear model:
\texttt{gold.price} = \alpha + \beta \times \texttt{stock.price} + \varepsilon_t
The residuals form a time series that looks stationary
That makes sense – We were able to fit the AR(1) model without issues
Now, let us examine the residuals \varepsilon_t from the OLS regression with the variables swapped:
\texttt{stock.price} = \alpha + \beta \times \texttt{gold.price} + \varepsilon_t
The residuals show an upward trend: the series is not stationary
Consequently, fitting an AR(1) model will likely encounter problems
\texttt{stock.price} = \alpha + \beta \times \texttt{gold.price} + \varepsilon_t
Error in arima(stock.price, xreg = gold.price, order = c(1, 0, 0)) :
non-stationary AR part from CSS
As expected, we get an error because the residuals \varepsilon_t are not stationary
This happens because the residuals \varepsilon_t are not stationary
Differencing: common technique to make a time series stationary by eliminating trends
diff(x) computes the difference between consecutive observations:\text{diff}(x)_t = x_t - x_{t-1}
# Fit AR(1) on differenced series
arima(diff(stock.price), xreg = diff(gold.price), order = c(1, 0, 0))Coefficients:
ar1 intercept diff(gold.price)
0.5141 1.0442 0.1632
s.e. 0.1510 0.7342 0.3879
Warning: AR(1) is fitted on differences \implies coefficients not valid for original series
Y_i = \beta_1 z_{i1} + \beta_2 z_{i2} + \ldots + \beta_p z_{ip} + \varepsilon_i
\det(Z^T Z ) \, \approx \, 0 \, \quad \implies \quad Z^T Z \, \text{ is (almost) not invertible}
\text{Multicollinearity = multiple (linear) relationships between the Z-variables}
Z-variables inter-related \quad \implies \quad hard to isolate individual influence on Y
| Z_1 | Z_2 | Z_3 |
|---|---|---|
| 10 | 50 | 52 |
| 15 | 75 | 75 |
| 18 | 90 | 97 |
| 24 | 120 | 129 |
| 30 | 150 | 152 |
Approximate collinearity for Z_3 and Z_1 Z_3 \approx 5 Z_1
All instances qualify as multicollinearity
| Z_1 | Z_2 | Z_3 |
|---|---|---|
| 10 | 50 | 52 |
| 15 | 75 | 75 |
| 18 | 90 | 97 |
| 24 | 120 | 129 |
| 30 | 150 | 152 |
| Z_1 | Z_2 | Z_3 |
|---|---|---|
| 10 | 50 | 52 |
| 15 | 75 | 75 |
| 18 | 90 | 97 |
| 24 | 120 | 129 |
| 30 | 150 | 152 |
Z = (Z_1 | Z_2 | \ldots | Z_p) = \left( \begin{array}{cccc} z_{11} & z_{12} & \ldots & z_{1p} \\ z_{21} & z_{22} & \ldots & z_{2p} \\ \ldots & \ldots & \ldots & \ldots \\ z_{n1} & z_{n2} & \ldots & z_{np} \\ \end{array} \right)
{\rm rank} (Z) < p
{\rm rank} \left( Z^T Z \right) = {\rm rank} \left( Z \right)
{\rm rank} \left( Z^T Z \right) < p \qquad \implies \qquad Z^T Z \,\, \text{ is NOT invertible}
\hat{\beta} = (Z^T Z)^{-1} Z^T y
Multicollinearity is a big problem!
Z_1, Z_2, Z_3 as before
Exact Multicollinearity since Z_2 = 5 Z_1
Thus Z^T Z is not invertible
Let us check with R
| Z_1 | Z_2 | Z_3 |
|---|---|---|
| 10 | 50 | 52 |
| 15 | 75 | 75 |
| 18 | 90 | 97 |
| 24 | 120 | 129 |
| 30 | 150 | 152 |
R computed that \,\, {\rm det} ( Z^T Z) = -3.531172 \times 10^{-7}
Therefore the determinant of Z^T Z is almost 0
Z^T Z \text{ is not invertible!}
Error in solve.default(t(Z) %*% Z) :
system is computationally singular: reciprocal condition number = 8.25801e-19
In practice, one almost never has exact Multicollinearity
If Multicollinearity is present, it is likely to be Approximate Multicollinearity
In case of approximate Multicollinearity, it holds that
Approximate Multicollinearity is still a big problem!
\text{Small perturbations in } Z \quad \implies \quad \text{large variations in } (Z^T Z)^{-1}
\text{Small perturbations in } Z \quad \implies \quad \text{large variations in } \xi_{ij}
\mathop{\mathrm{e.s.e.}}(\beta_j) = \xi_{jj}^{1/2} \, S \,, \qquad \quad S^2 = \frac{\mathop{\mathrm{RSS}}}{n-p}
\begin{align*} \text{Multicollinearity} & \quad \implies \quad \text{Numerical instability} \\[5pts] & \quad \implies \quad \text{potentially larger } \xi_{jj} \\[5pts] & \quad \implies \quad \text{potentially larger } \mathop{\mathrm{e.s.e.}}(\beta_j) \end{align*}
t = \frac{\beta_j}{ \mathop{\mathrm{e.s.e.}}(\beta_j) }
But Multicollinearity increases the \mathop{\mathrm{e.s.e.}}(\beta_j)
Therefore, the t-statistic reduces in size:
Multicollinearity \implies It becomes harder to reject incorrect hypotheses!
Z_1, Z_2, Z_3 as before
We know we have exact Multicollinearity, since Z_2 = 5 Z_1
Therefore Z^T Z is not invertible
| Z_1 | Z_2 | Z_3 |
|---|---|---|
| 10 | 50 | 52 |
| 15 | 75 | 75 |
| 18 | 90 | 97 |
| 24 | 120 | 129 |
| 30 | 150 | 152 |
To get rid of Multicollinearity we can add a small perturbation to Z_1 Z_1 \,\, \leadsto \,\, Z_1 + 0.01
The new dataset Z_1 + 0.01, Z_2, Z_3 is
| Z_1 | Z_2 | Z_3 |
|---|---|---|
| 10 | 50 | 52 |
| 15 | 75 | 75 |
| 18 | 90 | 97 |
| 24 | 120 | 129 |
| 30 | 150 | 152 |
Define the new design matrix Z = (Z_1 + 0.01 | Z_2 | Z_3)
Data is approximately Multicollinear
Therefore the inverse of Z^T Z exists
Let us compute this inverse in R
| Z_1 + 0.01 | Z_2 | Z_3 |
|---|---|---|
| 10.01 | 50 | 52 |
| 15.01 | 75 | 75 |
| 18.01 | 90 | 97 |
| 24.01 | 120 | 129 |
| 30.01 | 150 | 152 |
Z = (Z_1 + 0.01 | Z_2 | Z_3)
# Consider perturbation Z1 + 0.01
PZ1 <- Z1 + 0.01
# Assemble perturbed design matrix
Z <- matrix(c(PZ1, Z2, Z3), ncol = 3)
# Compute the inverse of Z^T Z
solve ( t(Z) %*% Z ) [,1] [,2] [,3]
[1,] 17786.804216 -3556.4700048 -2.4186075
[2,] -3556.470005 711.1358432 0.4644159
[3,] -2.418608 0.4644159 0.0187805
\text{ consider } \, Z_1 + 0.02 \, \text{ instead of } \, Z_1 + 0.01
# Consider perturbation Z1 + 0.02
PZ1 <- Z1 + 0.02
# Assemble perturbed design matrix
Z <- matrix(c(PZ1, Z2, Z3), ncol = 3)
# Compute the inverse of Z^T Z
solve ( t(Z) %*% Z ) [,1] [,2] [,3]
[1,] 4446.701047 -888.8947902 -1.2093038
[2,] -888.894790 177.7098841 0.2225551
[3,] -1.209304 0.2225551 0.0187805
[1] 19.4
\begin{align*} \text{Percentage Change} & = \left( \frac{\text{New Value} - \text{Old Value}}{\text{Old Value}} \right) \times 100\% \\[15pts] & = \left( \frac{(19.4 + 0.02) - (19.4 + 0.01)}{19.4 + 0.01} \right) \times 100 \% \ \approx \ 0.05 \% \end{align*}
\text{Percentage Change in } \, \xi_{11} = \frac{4446 - 17786}{17786} \times 100 \% \ \approx \ −75 \%
\text{perturbation of } \, 0.05 \% \, \text{ in the data } \quad \implies \quad \text{change of } \, - 75 \% \, \text{ in } \, \xi_{11}
\text{Small perturbations in } Z \quad \implies \quad \text{large variations in } (Z^T Z)^{-1}
Often can know in advance when you might experience Multicollinearity
This is a big sign of potential Multicollinearity
Why is this contradictory?
Multicollinearity is essentially a data-deficiency problem
Sometimes we have no control over the dataset available
Important point:
Multicollinearity is a sample feature
Possible that another sample involving the same variables will have less Multicollinearity
Acquiring more data might reduce severity of Multicollinearity
More data can be collected by either
To do this properly would require advanced Bayesian statistical methods
This is beyond the scope of this module
Multicollinearity may be reduced by transforming variables
This may be possible in various different ways
Further reading in Chapter 10 of [2]
Simplest approach to tackle Multicollinearity is to drop one or more of the collinear variables
Goal: Find the best combination of X variables which reduces Multicollinearity
We present 2 alternatives
To detect Multicollinearity, look out for
| Expenditure Y | Income X_2 | Wealth X_3 |
|---|---|---|
| 70 | 80 | 810 |
| 65 | 100 | 1009 |
| 90 | 120 | 1273 |
| 95 | 140 | 1425 |
| 110 | 160 | 1633 |
| 115 | 180 | 1876 |
| 120 | 200 | 2052 |
| 140 | 220 | 2201 |
| 155 | 240 | 2435 |
| 150 | 260 | 2686 |
# Enter data
y <- c(70, 65, 90, 95, 110, 115, 120, 140, 155, 150)
x2 <- c(80, 100, 120, 140, 160, 180, 200, 220, 240, 260)
x3 <- c(810, 1009, 1273, 1425, 1633, 1876, 2052, 2201, 2435, 2686)
# Fit model
model <- lm(y ~ x2 + x3)
# We want to display only part of summary
# First capture the output into a vector
temp <- capture.output(summary(model))
# Then print only the lines of interest
cat(paste(temp[9:20], collapse = "\n"))Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.77473 6.75250 3.669 0.00798 **
x2 0.94154 0.82290 1.144 0.29016
x3 -0.04243 0.08066 -0.526 0.61509
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.808 on 7 degrees of freedom
Multiple R-squared: 0.9635, Adjusted R-squared: 0.9531
F-statistic: 92.4 on 2 and 7 DF, p-value: 9.286e-06
Three basic statistics
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.77473 6.75250 3.669 0.00798 **
x2 0.94154 0.82290 1.144 0.29016
x3 -0.04243 0.08066 -0.526 0.61509
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.808 on 7 degrees of freedom
Multiple R-squared: 0.9635, Adjusted R-squared: 0.9531
F-statistic: 92.4 on 2 and 7 DF, p-value: 9.286e-06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.77473 6.75250 3.669 0.00798 **
x2 0.94154 0.82290 1.144 0.29016
x3 -0.04243 0.08066 -0.526 0.61509
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.808 on 7 degrees of freedom
Multiple R-squared: 0.9635, Adjusted R-squared: 0.9531
F-statistic: 92.4 on 2 and 7 DF, p-value: 9.286e-06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.77473 6.75250 3.669 0.00798 **
x2 0.94154 0.82290 1.144 0.29016
x3 -0.04243 0.08066 -0.526 0.61509
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.808 on 7 degrees of freedom
Multiple R-squared: 0.9635, Adjusted R-squared: 0.9531
F-statistic: 92.4 on 2 and 7 DF, p-value: 9.286e-06
Main red flag for Multicollinearity:
There are many contradictions:
High R^2 value suggests model is really good
However, low t-values imply neither Income nor Wealth affect Expenditure
F-statistic is high \implies at least one between Income or Wealth affect Expenditure
The Wealth estimator has the wrong sign (\hat \beta_3 < 0). This makes no sense:
Multicollinearity is definitely present!
Method 1: Computing the correlation:
[1] 0.9989624
This once again confirms Multicollinearity is present
Conclusion: The variables Income and Wealth are highly correlated
Method 2: Klein’s rule of thumb: Multicollinearity will be a serious problem if:
In the Expenditure vs Income and Wealth dataset we have:
Regressing Y against X_2 and X_3 gives R^2=0.9635
Regressing X_2 against X_3 gives R^2 = 0.9979
Klein’s rule of thumb suggests that Multicollinearity will be a serious problem
The variables Income and Wealth are highly correlated
Intuitively, we expect both Income and Wealth to affect Expenditure
Solution can be to drop either Income or Wealth variables
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.45455 6.41382 3.813 0.00514 **
x2 0.50909 0.03574 14.243 5.75e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.493 on 8 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9573
F-statistic: 202.9 on 1 and 8 DF, p-value: 5.753e-07
Strong evidence that Expenditure increases as Income increases
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.411045 6.874097 3.551 0.0075 **
x3 0.049764 0.003744 13.292 9.8e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.938 on 8 degrees of freedom
Multiple R-squared: 0.9567, Adjusted R-squared: 0.9513
F-statistic: 176.7 on 1 and 8 DF, p-value: 9.802e-07
Strong evidence that Expenditure increases as Wealth increases
First, we fitted the model
\text{Expenditure} = \beta_1 + \beta_2 \times \text{Wealth} + \beta_3 \times \text{Income} + \varepsilon
Method for comparing regression models
Involves iterative selection of predictor variables X to use in the model
It can be achieved through
Note: Significance criterion for X_j is in terms of AIC
Note: Stepwise Selection ensures that at each step all the variables are significant
# Fit the null model
null.model <- lm(y ~ 1)
# Fit the full model
full.model <- lm(y ~ x2 + x3 + ... + xp)Forward selection or Stepwise selection: Start with null model
Backward selection: Start with full model
# Stepwise selection
best.model <- step(null.model,
direction = "both",
scope = formula(full.model))
# Forward selection
best.model <- step(null.model,
direction = "forward",
scope = formula(full.model))
# Backward selection
best.model <- step(full.model,
direction = "backward") GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
Goal: Explain the number of Employed people Y in the US in terms of
Code for this example is available here longley_stepwise.R
Longley dataset available here longley.txt
Download the data file and place it in current working directory
# Read data file
longley <- read.table(file = "longley.txt", header = TRUE)
# Store columns in vectors
x2 <- longley[ , 1] # GNP Deflator
x3 <- longley[ , 2] # GNP
x4 <- longley[ , 3] # Unemployed
x5 <- longley[ , 4] # Armed Forces
x6 <- longley[ , 5] # Population
x7 <- longley[ , 6] # Year
y <- longley[ , 7] # EmployedFit the multiple regression model, including all predictors
Y = \beta_1 + \beta_2 \, X_2 + \beta_3 \, X_3 + \beta_4 \, X_4 + \beta_5 \, X_5 + \beta_6 \, X_6 + \beta_7 \, X_7 + \varepsilon

GNP.deflator GNP Unemployed Armed.Forces Population Year
GNP.deflator TRUE TRUE FALSE FALSE TRUE TRUE
GNP TRUE TRUE FALSE FALSE TRUE TRUE
Unemployed FALSE FALSE TRUE FALSE FALSE FALSE
Armed.Forces FALSE FALSE FALSE TRUE FALSE FALSE
Population TRUE TRUE FALSE FALSE TRUE TRUE
Year TRUE TRUE FALSE FALSE TRUE TRUE
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x3 -4.019e-02 1.647e-02 -2.440 0.032833 *
x4 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x5 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x7 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x3 -4.019e-02 1.647e-02 -2.440 0.032833 *
x4 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x5 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x7 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x3 -4.019e-02 1.647e-02 -2.440 0.032833 *
x4 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x5 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x7 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x3 -4.019e-02 1.647e-02 -2.440 0.032833 *
x4 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x5 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x7 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
R^2 = 1 - \frac{\mathop{\mathrm{RSS}}}{\mathop{\mathrm{TSS}}}
R^2 \leq 1
Drawback: R^2 increases when number of predictors increases
We saw this phenomenon in the Galileo example in Lecture 9
Fitting the simple model \rm{distance} = \beta_1 + \beta_2 \, \rm{height} + \varepsilon gave R^2 = 0.9264
In contrast the quadratic, cubic, and quartic models gave, respectively R^2 = 0.9903 \,, \qquad R^2 = 0.9994 \,, \qquad R^2 = 0.9998
Fitting a higher degree polynomial gives higher R^2
Conclusion: If the degree of the polynomial is sufficiently high, we can get R^2 = 1
(\rm{height}_1, \rm{distance}_1) \,, \ldots , (\rm{height}_n, \rm{distance}_n)
\hat y_i = y_i \,, \qquad \forall \,\, i = 1 , \ldots, n
\mathop{\mathrm{RSS}}= \sum_{i=1}^n (y_i - \hat y_i )^2 = 0 \qquad \implies \qquad R^2 = 1
Warning: Adding increasingly higher number of parameters is not always good
Example 1: In the Galileo example, we saw that
Example 2: In the Divorces example, however, things were different
# Divorces data
year <- c(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 15, 20, 25, 30)
percent <- c(3.51, 9.5, 8.91, 9.35, 8.18, 6.43, 5.31,
5.07, 3.65, 3.8, 2.83, 1.51, 1.27, 0.49)
# Fit linear model
linear <- lm(percent ~ year)
# Fit order 6 model
order_6 <- lm(percent ~ year + I( year^2 ) + I( year^3 ) +
I( year^4 ) + I( year^5 ) +
I( year^6 ))
# Scatter plot of data
plot(year, percent, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Years of marriage", side = 1, line = 3, cex = 2.1)
mtext("Risk of divorce", side = 2, line = 2.5, cex = 2.1)
# Plot Linear Vs Quadratic
polynomial <- Vectorize(function(x, ps) {
n <- length(ps)
sum(ps*x^(1:n-1))
}, "x")
curve(polynomial(x, coef(linear)), add=TRUE, col = "red", lwd = 2)
curve(polynomial(x, coef(order_6)), add=TRUE, col = "blue", lty = 2, lwd = 3)
legend("topright", legend = c("Linear", "Order 6"),
col = c("red", "blue"), lty = c(1,2), cex = 3, lwd = 3)
The best model seems to be Linear y = \beta_1 + \beta_2 x + \varepsilon
Linear model interpretation:
# Divorces data
year <- c(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 15, 20, 25, 30)
percent <- c(3.51, 9.5, 8.91, 9.35, 8.18, 6.43, 5.31,
5.07, 3.65, 3.8, 2.83, 1.51, 1.27, 0.49)
# Fit linear model
linear <- lm(percent ~ year)
# Fit order 6 model
order_6 <- lm(percent ~ year + I( year^2 ) + I( year^3 ) +
I( year^4 ) + I( year^5 ) +
I( year^6 ))
# Scatter plot of data
plot(year, percent, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Years of marriage", side = 1, line = 3, cex = 2.1)
mtext("Risk of divorce", side = 2, line = 2.5, cex = 2.1)
# Plot Linear Vs Quadratic
polynomial <- Vectorize(function(x, ps) {
n <- length(ps)
sum(ps*x^(1:n-1))
}, "x")
curve(polynomial(x, coef(linear)), add=TRUE, col = "red", lwd = 2)
curve(polynomial(x, coef(order_6)), add=TRUE, col = "blue", lty = 2, lwd = 3)
legend("topright", legend = c("Linear", "Order 6"),
col = c("red", "blue"), lty = c(1,2), cex = 3, lwd = 3)
# Divorces data
year <- c(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 15, 20, 25, 30)
percent <- c(3.51, 9.5, 8.91, 9.35, 8.18, 6.43, 5.31,
5.07, 3.65, 3.8, 2.83, 1.51, 1.27, 0.49)
# Fit linear model
linear <- lm(percent ~ year)
# Fit order 6 model
order_6 <- lm(percent ~ year + I( year^2 ) + I( year^3 ) +
I( year^4 ) + I( year^5 ) +
I( year^6 ))
# Scatter plot of data
plot(year, percent, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Years of marriage", side = 1, line = 3, cex = 2.1)
mtext("Risk of divorce", side = 2, line = 2.5, cex = 2.1)
# Plot Linear Vs Quadratic
polynomial <- Vectorize(function(x, ps) {
n <- length(ps)
sum(ps*x^(1:n-1))
}, "x")
curve(polynomial(x, coef(linear)), add=TRUE, col = "red", lwd = 2)
curve(polynomial(x, coef(order_6)), add=TRUE, col = "blue", lty = 2, lwd = 3)
legend("topright", legend = c("Linear", "Order 6"),
col = c("red", "blue"), lty = c(1,2), cex = 3, lwd = 3)
The AIC is a number which measures how well a regression model fits the data
Also R^2 measures how well a regression model fits the data
The difference between AIC and R^2 is that AIC also accounts for overfitting
The AIC is
{\rm AIC} := 2p - 2 \log ( \hat{L} )
p = number of parameters in the model
\hat{L} = maximum value of the likelihood function
\log(\hat L)= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\hat\sigma^2) - \frac{1}{2\hat\sigma^2} \mathop{\mathrm{RSS}}\,, \qquad \hat \sigma^2 := \frac{\mathop{\mathrm{RSS}}}{n}
\log(\hat L)= - \frac{n}{2} \log \left( \frac{\mathop{\mathrm{RSS}}}{n} \right) + C
C is constant depending only on the number of sample points
Thus, C does not change if the data does not change
{\rm AIC} = 2p + n \log \left( \frac{ \mathop{\mathrm{RSS}}}{n} \right) - 2C
# Divorces data
year <- c(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 15, 20, 25, 30)
percent <- c(3.51, 9.5, 8.91, 9.35, 8.18, 6.43, 5.31,
5.07, 3.65, 3.8, 2.83, 1.51, 1.27, 0.49)
# Scatter plot of data
plot(year, percent, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Years of marriage", side = 1, line = 3, cex = 2.1)
mtext("Risk of divorce", side = 2, line = 2.5, cex = 2.1)
| Years of Marriage | % divorces |
|---|---|
| 1 | 3.51 |
| 2 | 9.50 |
| 3 | 8.91 |
| 4 | 9.35 |
| 5 | 8.18 |
| 6 | 6.43 |
| 7 | 5.31 |
| 8 | 5.07 |
| 9 | 3.65 |
| 10 | 3.80 |
| 15 | 2.83 |
| 20 | 1.51 |
| 25 | 1.27 |
| 30 | 0.49 |
Code for this example available here divorces_stepwise.R
First we import the data into R
# Divorces data
year <- c(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 15, 20, 25, 30)
percent <- c(3.51, 9.5, 8.91, 9.35, 8.18, 6.43, 5.31,
5.07, 3.65, 3.8, 2.83, 1.51, 1.27, 0.49){\rm percent} = \beta_1 + \varepsilon
\rm{percent} = \beta_1 + \beta_2 \, {\rm year} + \beta_3 \, {\rm year}^2 + \ldots + \beta_7 \, {\rm year}^6
We run stepwise regression and save the best model:
Call:
lm(formula = percent ~ year)
.....
step made this choice, let us examine its outputStart: AIC=32.46
percent ~ 1
Df Sum of Sq RSS AIC
+ year 1 80.966 42.375 19.505
+ I(year^2) 1 68.741 54.600 23.054
+ I(year^3) 1 56.724 66.617 25.839
+ I(year^4) 1 48.335 75.006 27.499
+ I(year^5) 1 42.411 80.930 28.563
+ I(year^6) 1 38.068 85.273 29.295
<none> 123.341 32.463
Step: AIC=19.5
percent ~ year
Df Sum of Sq RSS AIC
<none> 42.375 19.505
+ I(year^4) 1 3.659 38.716 20.241
+ I(year^3) 1 3.639 38.736 20.248
+ I(year^5) 1 3.434 38.941 20.322
+ I(year^6) 1 3.175 39.200 20.415
+ I(year^2) 1 2.826 39.549 20.539
- year 1 80.966 123.341 32.463
step begins with the null model: percent ~ 1step then tests adding each variable from the full model one at a time+ I(year^k)year. Therefore, the variable year is selectedpercent ~ yearStart: AIC=32.46
percent ~ 1
Df Sum of Sq RSS AIC
+ year 1 80.966 42.375 19.505
+ I(year^2) 1 68.741 54.600 23.054
+ I(year^3) 1 56.724 66.617 25.839
+ I(year^4) 1 48.335 75.006 27.499
+ I(year^5) 1 42.411 80.930 28.563
+ I(year^6) 1 38.068 85.273 29.295
<none> 123.341 32.463
Step: AIC=19.5
percent ~ year
Df Sum of Sq RSS AIC
<none> 42.375 19.505
+ I(year^4) 1 3.659 38.716 20.241
+ I(year^3) 1 3.639 38.736 20.248
+ I(year^5) 1 3.434 38.941 20.322
+ I(year^6) 1 3.175 39.200 20.415
+ I(year^2) 1 2.826 39.549 20.539
- year 1 80.966 123.341 32.463
step resumes from the best model percent ~ year<none>), i.e. keeping the current model+ I(year^k))- year)step again would yield the same result. Thus, the final model is: percent ~ year