In Lecture 9 we have introduced the general linear regression model
Yi=β1zi1+β2zi2+…+βpzip+εi
There are p predictor random variables
Z1,…,Zp
Yi is the conditional distribution
Y∣Z1=zi1,…,Zp=zip
The errors εi are random variables
Regression assumptions on Yi
Predictor is known: The values zi1,…,zip are known
Normality: The distribution of Yi is normal
Linear mean: There are parameters β1,…,βp such that IE[Yi]=β1zi1+…+βpzip
Homoscedasticity: There is a parameter σ2 such that Var[Yi]=σ2
Independence: rv Y1,…,Yn are independent and thus uncorrelated
Cor(Yi,Yj)=0∀i=j
Equivalent assumptions on εi
Predictor is known: The values zi1,…,zip are known
Normality: The distribution of εi is normal
Linear mean: The errors have zero mean IE[εi]=0
Homoscedasticity: There is a parameter σ2 such that Var[εi]=σ2
Independence: Errors ε1,…,εn are independent and thus uncorrelated
Cor(εi,εj)=0∀i=j
Extra assumption on design matrix
The design matrix Z is such that
ZTZ is invertible
Assumptions 1-6 allowed us to estimate the parameters
β=(β1,…,βp)
By maximizing the likelihood we obtained estimator
β^=(ZTZ)−1ZTy
Violation of Assumptions
We consider 3 scenarios
Heteroscedasticity: The violation of Assumption 4 of homoscedasticity
Var[εi]=Var[εj] for some i=j
Autocorrelation: The violation of Assumption 5 of no-correlation
Cor(εi,εj)=0 for some i=j
Multicollinearity: The violation of Assumption 6 of invertibilty of the matrix
ZTZ
Part 2: Heteroscedasticity
Heteroscedasticity
The general linear regression model is
Yi=β1zi1+β2zi2+…+βpzip+εi
Consider Assumption 4
Homoscedasticity: There is a parameter σ2 such that Var[εi]=σ2∀i
Heteroscedasticity: The violation of Assumption 4
Var[εi]=Var[εj] for some i=j
Why is homoscedasticity important?
In Lecture 10 we presented a few methods to assess linear models
Coefficient R2
t-tests for parameters significance
F-test for model selection
The above methods rely heavily on homoscedasticity
Why is homoscedasticity important?
For example the maximum likelihood estimation relied on the calculation L=i=1∏nfYi(yi)=i=1∏n2πσ21exp(−2σ2(yi−y^i)2)=(2πσ2)n/21exp(−2σ2∑i=1n(yi−y^i)2)=(2πσ2)n/21exp(−2σ2RSS)
The calculation is only possible thanks to homoscedasticity
Var[Yi]=σ2∀i
Why is homoscedasticity important?
Suppose the calculation in previous slide holds
L=(2πσ2)n/21exp(−2σ2RSS)
Then maximizing the likelihood is equivalent to solving
βminRSS
The above has the closed form solution
β^=(ZTZ)−1ZTy
Why is homoscedasticity important?
Without homoscedasticity we would have
L=(2πσ2)n/21exp(−2σ2RSS)
Therefore β^ would no longer maximize the likelihood!
In this case β^ would still be an unbiased estimator for β
IE[β^]=β
Why is homoscedasticity important?
However the quantity S2=n−pRSS(β^) is not anymore unbiased estimator for the population variance σ2IE[S2]=σ2
This is a problem because the estimated standard error for βj involves S2e.s.e.(βj)=ξjj1/2S
Therefore e.s.e. becomes unreliable
Why is homoscedasticity important?
Then also t-statistic for significance of βj becomes unreliable
This is because the t-statistic depends on e.s.e.
t=e.s.e.β^j−βj
Without homoscedasticity the regression maths does not work!
t-tests for significance of βj
confidence intervals for βj
F-tests for model selection
They all break down and become unreliable!
Is heteroscedastcity a serious problem?
Heteroscedasticity in linear regression is no longer a big problem
This is thanks to 1980s research on robust standard errors (more info here)
Moreover heteroscedasticity only becomes a problem when it is severe
How to detect Heteroscedasticity
Heteroscedasticity is commonly present in real-world datasets
# Compute fitted valuesfitted <- model$fitted# Plot the residual graphplot(fitted, residuals,xlab ="Fitted Values", ylab ="Residuals",pch =16)# Add y = 0 line for referenceabline(0, 0, col ="red", lwd =3)
Remedial transformation: To try and reduce heteroscedasticity take
logy
This means we need to fit the model
logYi=α+βXi+εi
# Transform data via log(y)log.gold.price <-log(gold.price)# Fit linear model with log(y) datalog.model <-lm(log.gold.price ~ stock.price)# Compute residuals for log modellog.residuals <- log.model$resid# Compute fitted values for log modellog.fitted <- log.model$fitted
Heteroscedasticity has definitely reduced
Left: Residual plot for original model
Right: Residual plot for logy data model
Heteroscedasticity has definitely reduced
Left: Histogram of residuals for original model
Right: Histogram of residuals for logy data model
Part 2: Autocorrelation
Autocorrelation
The general linear regression model is
Yi=β1zi1+β2zi2+…+βpzip+εi
Consider Assumption 5
Independence: Errors ε1,…,εn are independent and thus uncorrelated Cor(εi,εj)=0∀i=j
Autocorrelation: The violation of Assumption 5
Cor(εi,εj)=0 for some i=j
Why is independence important?
Recall the methods to assess linear models
Coefficient R2
t-tests for parameters significance
F-test for model selection
The above methods rely heavily on independence
Why is independence important?
Once again let us consider the likelihood calculation L=f(y1,…,yn)=i=1∏nfYi(yi)=(2πσ2)n/21exp(−2σ2∑i=1n(yi−y^i)2)=(2πσ2)n/21exp(−2σ2RSS)
The second equality is only possible thanks to independence of
Y1,…,Yn
Why is independence important?
If we have autocorrelation then
Cor(εi,εj)=0 for some i=j
In particualar we would have
εi and εj dependent ⟹Yi and Yj dependent
Therefore the calculation in previous slide breaks down
L=(2πσ2)n/21exp(−2σ2RSS)
Why is independence important?
In this case β^ does no longer maximize the likelihood!
As already seen, this implies that
e.s.e.(βj) is unreliable
Without independence the regression maths does not work!
t-tests for significance of βj
confidence intervals for βj
F-tests for model selection
They all break down and become unreliable!
Causes of Autocorrelation
Time-series data
Autocorrelation means that
Cor(εi,εj)=0 for some i=j
Autocorrelation if often unavoidable
Typically associated with time series data
Observations ordered wrt time or space are usually correlated
This is because observations taken close together may take similar values
Causes of Autocorrelation
Financial data
Autocorrelation is especially likely for datasets in
Accounting
Finance
Economics
Autocorrelation is likely if the data have been recorded over time
E.g. daily, weekly, monthly, quarterly, yearly
Example: Datasetet on Stock prices and Gold prices
General linear regression model assumes uncorrelated errors
Not realistic to assume that price observations for say 2020 and 2021 would be independent
Causes of Autocorrelation
Inertia
Economic time series tend to exhibit cyclical behaviour
Examples include GNP, price indices, production figures, employment statistics etc.
Since these series tend to be quite slow moving
Effect of inertia is that successive observations are highly correlated
This is an extremely common phenomenon in financial and economic time series
Causes of Autocorrelation
Cobweb Phenomenon
Characteristic of industries in which a large amount of time passes between
the decision to produce something
and its arrival on the market
Cobweb phenomenon is common with agricultural commodities
Economic agents (e.g. farmers) decide
how many goods to supply to the market
based on previous year price
Causes of Autocorrelation
Cobweb Phenomenon
Example: the amount of crops farmers supply to the market at time t might be
Supplyt=β1+β2Pricet−1+εt(3)
Errors εt in equation (3) are unlikely to be completely random and patternless
This is because they represent actions of intelligent economic agents (e.g. farmers)
Error terms are likely to be autocorrelated
Causes of Autocorrelation
Data manipulation
Examples:
Quarterly data may smooth out the wild fluctuations in monthly sales figures
Low frequency economic survey data may be interpolated
However: Such data transformations may be inevitable
In social sciences data quality may be variable
This may induce systematic patterns and autocorrelation
No magic solution – Autocorrelation is unavoidable and must be considered
Detection of autocorrelation
Statistical tests
Runs test
Durbin-Watson test
Graphical methods
Simpler but can be more robust and more informative
Graphical and statistical methods can be useful cross-check of each other!
Graphical tests for autocorrelation
Time-series plot of residuals
Plot residuals et over time
Check to see if any evidence of a systematic pattern exists
Autocorrelation plot of residuals
Natural to think that εt and εt−1 may be correlated
Plot residual et against et−1
Important:
No Autocorrelation: Plots will look random
Yes Autocorrelation: Plots will show certain patterns
# Enter datay <-c(70, 65, 90, 95, 110, 115, 120, 140, 155, 150)x2 <-c(80, 100, 120, 140, 160, 180, 200, 220, 240, 260)x3 <-c(810, 1009, 1273, 1425, 1633, 1876, 2052, 2201, 2435, 2686)# Fit modelmodel <-lm(y ~ x2 + x3)# We want to display only part of summary# First capture the output into a vectortemp <-capture.output(summary(model))# Then print only the lines of interestcat(paste(temp[9:20], collapse ="\n"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.77473 6.75250 3.669 0.00798 **
x2 0.94154 0.82290 1.144 0.29016
x3 -0.04243 0.08066 -0.526 0.61509
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.808 on 7 degrees of freedom
Multiple R-squared: 0.9635, Adjusted R-squared: 0.9531
F-statistic: 92.4 on 2 and 7 DF, p-value: 9.286e-06
Three basic statistics
R2 coefficient
t-statistics and related p-values
F-statistic and related p-value
Interpreting the R output
R2=0.9635
Model explains a substantial amount of the variation (96.35%) in the data
F-statistic is
F=92.40196
Corresponding p-value is p=9.286×10−6
There is evidence p<0.05 that at least one between income and wealth affect expenditure
Interpreting the R output
t-statistics
t-statistics for income is t=1.144
Corresponding p-value is p=0.29016
t-statistic for wealth is t=−0.526
Corresponding p-value is p=0.61509
Both p-values are p>0.05
Therefore neither income nor wealth are individually statistically significant
This means regression parameters are β2=β3=0
Main red flag for Multicollinearity:
High R2 value coupled with low t-values (corresponding to high p-values)
The output looks strange
There are many contradictions
High R2 value suggests model is really good
However low t-values imply neither income nor wealth affect expenditure
F-statistic is high, meaning that at least one between income nor wealth affect expenditure
The wealth variable has the wrong sign (β^3<0)
This makes no sense: it is likely that expenditure will increase as wealth increases
Therefore we would expect β^3>0
Multicollinearity is definitely present!
Detection of Multicollinearity
Computing the correlation
For further confirmation of Multicollinearity compute correlation of X2 and X3
cor(x2, x3)
[1] 0.9989624
Correlation is almost 1: Variables X2 and X3 are very highly correlated
This once again confirms Multicollinearity is present
Conclusion: The variables Income and Wealth are highly correlated
Impossible to isolate individual impact of either Income or Wealth upon Expenditure
Detection of Multicollinearity
Klein’s rule of thumb
Klein’s rule of thumb: Multicollinearity will be a serious problem if:
The R2 obtained from regressing predictor variables X is greater than the overall R2 obtained by regressing Y against all the X variables
Example: In the Expenditure vs Income and Wealth dataset we have
Regressing Y against X2 and X3 gives R2=0.9635
Regressing X2 against X3 gives R2=0.9979
Klein’s rule of thumb suggests that Multicollinearity will be a serious problem
Remedial measures
Do nothing
Multicollinearity is essentially a data-deficiency problem
Sometimes we have no control over the dataset available
Important point:
Doing nothing should only be an option in quantitative social sciences (e.g. finance, economics) where data is often difficult to collect
For scientific experiments (e.g. physics, chemistry) one should strive to collect good data
Remedial measures
Acquire new/more data
Multicollinearity is a sample feature
Possible that another sample involving the same variables will have less Multicollinearity
Acquiring more data might reduce severity of Multicollinearity
More data can be collected by either
increasing the sample size or
including additional variables
Remedial measures
Use prior information about some parameters
To do this properly would require advanced Bayesian statistical methods
This is beyond the scope of this module
Remedial measures
Rethinking the model
Sometimes a model chosen for empirical analysis is not carefully thought out
Some important variables may be omitted
The functional form of the model may have been incorrectly chosen
Sometimes using more advanced statistical techniques may be required
Factor Analysis
Principal Components Analysis
Ridge Regression
Above techniques are outside the scope of this module
Remedial measures
Transformation of variables
Multicollinearity may be reduced by transforming variables
This may be possible in various different ways
E.g. for time-series data one might consider forming a new model by taking first differences
ANOVA is used to compare means of independent populations:
Suppose to have M normally distributed populations N(μk,σ2) (with same variance)
The goal is to compare the populations averages μk
For M=2 this can be done with the two-sample t-test
For M>2 we use ANOVA
ANOVA is a generalization of two-sample t-test to multiple populations
The ANOVA F-test
To compare populations averages we use the hypotheses
H0H1:μ1=μ2=…=μM:μi=μj for at least one pair i and j
To decide on the above hypothesis we use the ANOVA F-test
We omit mathematical details. All you need to know is that
ANOVA F-test is performed in R with the function aov
This function outputs the so-called ANOVA table
ANOVA table contains the F-statistic and relative p-value for ANOVA F-test
If p<0.05 we reject H0 and there is a difference in populations averages
ANOVA F-test for fridge sales
In the Fridge Sales example we wish to compare Fridge sales numbers
We have Fridge Sales data for each quarter
Each Quarter represents a population
We want to compare average fridge sales for each Quarter
ANOVA hypothesis test: is there a difference in average sales for each Quarter?
If μi is average sales in Quarter i then H0H1:μ1=μ2=μ3=μ4:μi=μj for at least one pair i and j
Plotting the data
Plot Quarter against Fridge sales
plot(quarter, fridge)
We clearly see 4 populations
Averages appear different
ANOVA F-test for fridge sales
To implement ANOVA F-test in R you need to use command factor
# Turn vector quarter into a factorquarter.f <-factor(quarter)
The factor quarter.f allows to label the fridge sales data
Factor level k corresponds to fridge sales in Quarter k
To perform the ANOVA F-test do
# Fit ANOVA modelfactor.aov <-aov(fridge ~ quarter.f)# Print outputsummary(factor.aov)
ANOVA output
The summary gives the following analysis of variance (ANOVA) table
Df Sum Sq Mean Sq F value Pr(>F)
quarter.f 3 915636 305212 10.6 7.91e-05 ***
Residuals 28 806142 28791
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The F-statistic for the ANOVA F-test is F=10.6
The p-value for ANOVA F-test is p=7.91×10−5
Therefore p<0.05 and we reject H0
Evidence that average Fridge sales are different in at least two quarters
ANOVA F-statistic from regression
Alternative way to get ANOVA F-statistic: Look at output of dummy variable model
lm(fridge∼q2 + q3 + q4)
Residual standard error: 169.7 on 28 degrees of freedom
Multiple R-squared: 0.5318, Adjusted R-squared: 0.4816
F-statistic: 10.6 on 3 and 28 DF, p-value: 7.908e-05
The F-test for overall fit gives
F-statistic F=10.6
p-value p=7.908×10−5
These always coincide with F-statistic and p-value for ANOVA F-test! Why?
Part 5: ANOVA F-test and regression
ANOVA F-test and regression
ANOVA F-test equivalent to F-test for overall significance
This happens because
Linear regression can be used to perform ANOVA F-test
We have already seen a particular instance of this fact in the Homework
Simple linear regression can be used to perform two-sample t-test
This was done by considering the model
Yi=α+β1B(i)+εi
Simple regression for two-sample t-test
In more details:
A and B are two normal populations N(μA,σ2) and N(μB,σ2)
We have two samples
Sample of size nA from population Aa=(a1,…,anA)
Sample of size nB from population Bb=(b1,…,bnB)
Simple regression for two-sample t-test
The data vector y is obtained by concatenating a and b
y=(a,b)=(a1,…,anA,b1,…,bnB)
We then considered the dummy variable model
Yi=α+β1B(i)+εi
Here 1B(i) is the dummy variable relative to population B
1B(i)={10 if i-th sample belongs to population B otherwise
Simple regression for two-sample t-test
The regression function is therefore
IE[Y∣1B=x]=α+βx
By construction we have that
Y∣sample belongs to population A∼N(μA,σ2)
Therefore by definition of 1B we get
IE[Y∣1B=0]=IE[Y∣ sample belongs to population A]=μA
Simple regression for two-sample t-test
On the other hand, the regression function is
IE[Y∣1B=x]=α+βx
Thus
IE[Y∣1B=0]=α+β⋅0=α
Recall that IE[Y∣1B=0]=μA
Hence we get
α=μA
Simple regression for two-sample t-test
Similarly, we get that
IE[Y∣1B=1]=IE[Y∣ sample belongs to population B]=μB
On the other hand, using the regression function we get
IE[Y∣1B=1]=α+β⋅1=α+β
Therefore
α+β=μB
Simple regression for two-sample t-test
Since α=μA we get
α+β=μB⟹β=μB−μA
In particular we have shown that the regression model is
Yi=α+β1B(i)+εi=μA+(μB−μA)1B(i)+εi
Simple regression for two-sample t-test
F-test for overall significance for previous slide model is
H0:β=0H1:β=0
Since β=μB−μA the above is equivalent to
H0:μA=μBH1:μA=μB
F-test for overall significance is equivalent to two-sample t-test
ANOVA F-test and regression
We now consider the general ANOVA case
Suppose to have M populations Ai with normal distribution N(μi,σ2) (with same variance)
Example: In Fridge sales example we have M=4 populations (the 4 quarters)
The ANOVA F-test for difference in populations means is
H0H1:μ1=μ2=…=μM:μi=μj for at least one pair i and j
Goal: Show that the above test can be obtained with regression
Setting up dummy variable model
We want to introduce dummy variable regression model which models ANOVA
To each population Ai associate a dummy variable
1Ai(i)={10 if i-th sample belongs to population Ai otherwise
Suppose to have samples ai of size nAi from population Ai
Concatenate these samples into a long vector (length nA1+…+nAM)
y=(a1,…,aM)
Setting up dummy variable model
Consider the dummy variable model (with 1A1 omitted)
Yi=β1+β21A2(i)+β31A3(i)+…+βM1AM(i)+εi
The regression function is therefore
IE[Y∣1A2=x2,…,1AM=xM]=β1+β2x2+…+βMxM
By construction we have that
Y∣sample belongs to population Ai∼N(μi,σ2)
Conditional expectations
The sample belongs to population A1 if and only if
1A2=1A3=…=1AM=0
Hence we can compute the conditional expectation
IE[Y∣1A2=0,…,1AM=0]=IE[Y∣sample belongs to population A1]=μ1
On the other hand, by definition of regression function, we get
IE[Y∣1A2=0,…,1AM=0]=β1+β2⋅0+…+βM⋅0=β1
We conclude that μ1=β1
Conditional expectations
Similarly, the sample belongs to population A2 if and only if
1A2=1and1A1=1A3=…=1AM=0
Hence we can compute the conditional expectation
IE[Y∣1A2=1,…,1AM=0]=IE[Y∣sample belongs to population A2]=μ2
On the other hand, by definition of regression function, we get
IE[Y∣1A2=1,…,1AM=0]=β1+β2⋅1+…+βM⋅0=β1+β2
We conclude that μ2=β1+β2
Conditional expectations
We have shown that
μ1=β1 and μ2=β1+β2
Therefore we obtain
β1=μ1 and β2=μ2−μ1
Arguing in a similar way, we can show that also
βi=μi−μ1∀i>1
Conclusion
The considered dummy variable model is
Yi=β1+β21A2(i)+…+βM1AM(i)+εi
The F-test of overall significance for the above regression model is
H0H1:β2=β3=…=βM=0: At least one of the above βj=0
We have also shown that
β1=μ1,βi=μi−μ1∀i>1
In particular we obtain
β2=…=βM=0⟺μ2−μ1=…=μM−μ1=0⟺μ2=…=μM=μ1
Thefore the F-test for overall significance is equivalent to ANOVA F-test
H0H1:μ1=μ2=…=μM:μi=μj for at least one pair i and j
F-test for overall significance is equivalent to ANOVA F-test
References
[1]
Gujarati, D. N., Porter, D. C., Basic econometric, fifth edition, McGraw-Hill, 2009.
[2]
Draper, N. R., Smith, H., Applied regression analysis, Wiley, 1998.
[3]
Shumway, R. H., Stoffer, D. S., Time series analysis and its applications, fourth edition, Springer, 2017.