(X,Y) random vector with joint pdf or pmf fX,Y and marginal pdfs or pmfs fX,fY. We say that X and Y are independent random variables if fX,Y(x,y)=fX(x)fY(y),∀(x,y)∈R2
Independence of random variables
Conditional distributions and probabilities
If X and Y are independent then X gives no information on Y (and vice-versa):
Conditional distribution: Y∣X is same as Yf(y∣x)=fX(x)fX,Y(x,y)=fX(x)fX(x)fY(y)=fY(y)
Conditional probabilities: From the above we also obtain P(Y∈A∣x)P(Y∈A∣x)=y∈A∑f(y∣x)=y∈A∑fY(y)=P(Y∈A)=∫y∈Af(y∣x)dy=∫y∈AfY(y)dy=P(Y∈A) discrete rv continuous rv
Independence of random variables
Characterization of independence - Densities
Theorem
(X,Y) random vector with joint pdf or pmf fX,Y. They are equivalent:
X and Y are independent random variables
There exist functions g(x) and h(y) such that fX,Y(x,y)=g(x)h(y),∀(x,y)∈R2
Note:
g(x) and h(y) are not necessarily the pdfs or pmfs of X and Y
However they coincide with fX and fY, up to rescaling by a constant
Exercise
A student leaves for class between 8 AM and 8:30 AM and takes between 40 and 50 minutes to get there
Denote by X the time of departure
X=0 corresponds to 8 AM
X=30 corresponds to 8:30 AM
Denote by Y the travel time
Assume that X and Y are independent and uniformly distributed
Question: Find the probability that the student arrives to class before 9 AM
Solution
By assumption X is uniform on (0,30). Therefore fX(x)={3010 if x∈(0,30) otherwise
By assumption Y is uniform on (40,50). Therefore fY(y)={1010 if y∈(40,50) otherwise where we used that 50−40=10
Solution
Define the rectangle R=(0,30)×(40,50)
Since X and Y are independent, we get
fX,Y(x,y)=fX(x)fY(y)={30010 if (x,y)∈R otherwise
Solution
The arrival time is given by X+Y
Therefore, the student arrives to class before 9 AM iff X+Y<60
Notice that {X+Y<60}={(x,y)∈R2:0≤x<60−y,40≤y<50}
Solution
Therefore, the probability of arriving before 9 AM is
P(arrives before 9 AM)=P(X+Y<60)=∫{X+Y<60}fX,Y(x,y)dxdy=∫4050(∫060−y3001dx)dy=3001∫4050(60−y)dy=3001y(60−2y)y=40y=50=3001⋅(1750−1600)=21
Consequences of independence
Theorem
Suppose X and Y are independent random variables. Then
For any A,B⊂R we have P(X∈A,Y∈B)=P(X∈A)P(Y∈B)
Suppose g(x) is a function of (only) x, h(y) is a function of (only) y. Then IE[g(X)h(Y)]=IE[g(X)]IE[h(Y)]
Application: MGF of sums
Theorem
Suppose X and Y are independent random variables and denote by MX and MY their MGFs. Then MX+Y(t)=MX(t)MY(t)
Proof: Follows by previous Theorem MX+Y(t)=IE[et(X+Y)]=IE[etXetY]=IE[etX]IE[etY]=MX(t)MY(t)
Example - Sum of independent normals
Suppose X∼N(μ1,σ12) and Y∼N(μ2,σ22) are independent normal random variables
We have seen in Lecture 1 that for normal distributions MX(t)=exp(μ1t+2t2σ12),MY(t)=exp(μ2t+2t2σ22)
Since X and Y are independent, from previous Theorem we get MX+Y(t)=MX(t)MY(t)=exp(μ1t+2t2σ12)exp(μ2t+2t2σ22)=exp((μ1+μ2)t+2t2(σ12+σ22))
Example - Sum of independent normals
Therefore Z:=X+Y has moment generating function MZ(t)=MX+Y(t)=exp((μ1+μ2)t+2t2(σ12+σ22))
The above is the mgf of a normal distribution with mean μ1+μ2 and varianceσ12+σ22
By the Theorem in Slide 68 of Lecture 1 we have Z∼N(μ1+μ2,σ12+σ22)
Sum of independent normals is normal
Covariance & Correlation
Relationship between RV
Given two random variables X and Y we said that
X and Y are independent if fX,Y(x,y)=fX(x)gY(y)
In this case there is no relationship between X and Y
This is reflected in the conditional distributions: X∣Y∼XY∣X∼Y
Covariance & Correlation
Relationship between RV
If X and Y are not independent then there is a relationship between them
Question
How do we measure the strength of such dependence?
Answer: By introducing the notions of
Covariance
Correlation
Covariance
Definition
Notation: Given two rv X and Y we denote μX:=IE[X]σX2:=Var[X]μYσY2:=IE[Y]:=Var[Y]
Definition
The covariance of X and Y is the number Cov(X,Y):=IE[(X−μX)(Y−μY)]
Covariance
Alternative Formula
Theorem
The covariance of X and Y can be computed via Cov(X,Y)=IE[XY]−IE[X]IE[Y]
Correlation
Remark:
Cov(X,Y) encodes only qualitative information about the relationship between X and Y
To obtain quantitative information we introduce the correlation
Definition
The correlation of X and Y is the number ρXY:=σXσYCov(X,Y)
Correlation detects linear relationships
Theorem
For any random variables X and Y we have
−1≤ρXY≤1
∣ρXY∣=1 if and only if there exist a,b∈R such that P(Y=aX+b)=1
If X and Y are independent random variables then Cov(X,Y)=0,ρXY=0
Proof:
If X and Y are independent then IE[XY]=IE[X]IE[Y]
Therefore Cov(X,Y)=IE[XY]−IE[X]IE[Y]=0
Moreover ρXY=0 by definition
Formula for Variance
Variance is quadratic
Theorem
For any two random variables X and Y and a,b∈RVar[aX+bY]=a2Var[X]+b2Var[Y]+2Cov(X,Y) If X and Y are independent then Var[aX+bY]=a2Var[X]+b2Var[Y]
Proof: Exercise
Example 1
Assume X and Z are independent, and X∼uniform(0,1),Z∼uniform(0,101)
Consider the random variable Y=X+Z
Since X and Z are independent, and Z is uniform, we have that Y∣X=x∼uniform(x,x+101) (adding x to Z simply shifts the uniform distribution of Z by x)
Question: Is the correlation ρXY between X and Y high or low?
Example 1
As Y∣X∼uniform(X,X+101), the conditional pdf of Y given X=x is f(y∣x)={100 if y∈(x,x+101) otherwise
As X∼uniform(0,1), its pdf is fX(x)={10 if x∈(0,1) otherwise
Therefore, the joint distribution of (X,Y) is fX,Y(x,y)=f(y∣x)fX(x)={100 if x∈(0,1) and y∈(x,x+101) otherwise
Example 1
In gray: the region where fX,Y(x,y)>0
When X increases, Y increases linearly (not surprising, since Y=X+Z)
We expect the correlation ρXY to be close to 1
Example 1 – Computing ρXY
For a random variable W∼uniform(a,b), we have IE[W]=2a+b,Var[W]=12(b−a)2
Since X∼uniform(0,1) and Z∼uniform(0,1/10), we have IE[X]=21,Var[X]=121,IE[Z]=201,Var[Z]=12001
Since X and Z are independent, we also have Var[Y]=Var[X+Z]=Var[X]+Var[Z]=121+12001
Example 1 – Computing ρXY
Since X and Z are independent, we have IE[XZ]=IE[X]IE[Z]
We conclude that Cov(X,Y)=IE[XY]−IE[X]IE[Y]=IE[X(X+Z)]−IE[X]IE[X+Z]=IE[X2]−IE[X]2+IE[XZ]−IE[X]IE[Z]=Var[X]=121
Example 1 – Computing ρXY
The correlation between X and Y is ρXY=Var[X]Var[Y]Cov(X,Y)=121121+12001121=101100
As expected, we have very high correlation ρXY≈1
This confirms a very strong linear relationship between X and Y
Example 2
Assume X and Z are independent, and X∼uniform(−1,1),Z∼uniform(0,101)
Define the random variable Y=X2+Z
Since X and Z are independent, and Z is uniform, we have that Y∣X=x∼uniform(x2,x2+101) (adding x2 to Z simply shifts the uniform distribution of Z by x2)
Question: Is the correlation ρXY between X and Y high or low?
Example 2
As Y∣X∼uniform(X2,X2+101), the conditional pdf of Y given X=x is f(y∣x)={100 if y∈(x2,x2+101) otherwise
As X∼uniform(−1,1), its pdf is fX(x)={210 if x∈(−1,1) otherwise
Therefore, the joint distribution of (X,Y) is fX,Y(x,y)=f(y∣x)fX(x)={100 if x∈(−1,1) and y∈(x2,x2+101) otherwise
Example 2
In gray: the region where fX,Y(x,y)>0
When X increases, Y increases quadratically (not surprising, as Y=X2+Z)
There is no linear relationship between X and Y⟹ we expect ρXY≈0
Example 2 – Computing ρXY
Since X∼uniform(−1,1), we can compute that IE[X]=IE[X3]=0
Since X and Z are independent, we have IE[XZ]=IE[X]IE[Z]=0
Example 2 – Computing ρXY
Compute the covariance Cov(X,Y)=IE[XY]−IE[X]IE[Y]=IE[XY]=IE[X(X2+Z)]=IE[X3]+IE[XZ]=0
The correlation between X and Y is ρXY=Var[X]Var[Y]Cov(X,Y)=0
This confirms there is no linear relationship between X and Y
Part 2: Multivariate random vectors
Multivariate Random Vectors
Recall
A Random vector is a function X:Ω→Rn
X is a multivariate random vector if n≥3
We denote the components of X by X=(X1,…,Xn),Xi:Ω→R
We denote the components of a point x∈Rn by x=(x1,…,xn)
Discrete and Continuous Multivariate Random Vectors
Everything we defined for bivariate vectors extends to multivariate vectors
Definition
The random vector X:Ω→Rn is:
continuous if components Xis are continuous
discrete if components Xi are discrete
Joint pmf
Definition
The joint pmf of a continuous random vector X is fX:Rn→R defined by fX(x)=fX(x1,…,xn):=P(X1=x1,…,Xn=xn),∀x∈Rn
Note: For all A⊂Rn it holds P(X∈A)=x∈A∑fX(x)
Joint pdf
Definition
The joint pdf of a continuous random vector X is a function fX:Rn→R such that P(X∈A):=∫AfX(x1,…,xn)dx1…dxn=∫AfX(x)dx,∀A⊂Rn
Note: ∫A denotes an n-fold intergral over all points x∈A
Expected Value
Definition
X:Ω→Rn random vector and g:Rn→R function. The expected value of the random variable g(X) is IE[g(X)]IE[g(X)]:=x∈Rn∑g(x)fX(x):=∫Rng(x)fX(x)dx(X discrete)(X continuous)
Marginal distributions
Marginal pmf or pdf of any subset of the coordinates (X1,…,Xn) can be computed by integrating or summing the remaining coordinates
To ease notations, we only define maginals wrt the first k coordinates
Definition
The marginal pmf or marginal pdf of the random vector X with respect to the first k coordinates is the function f:Rk→R defined by f(x1,…,xk)f(x1,…,xk):=(xk+1,…,xn)∈Rn−k∑fX(x1,…,xn):=∫Rn−kfX(x1,…,xn)dxk+1…dxn(X discrete)(X continuous)
Marginal distributions
We use a special notation for marginal pmf or pdf wrt a single coordinate
Definition
The marginal pmf or pdf of the random vector X with respect to the i-th coordinate is the function fXi:R→R defined by fXi(xi)fXi(xi):=x~∈Rn−1∑fX(x1,…,xn):=∫Rn−1fX(x1,…,xn)dx~(X discrete)(X continuous) where x~∈Rn−1 denotes the vector x with i-th component removed x~:=(x1,…,xi−1,xi+1,…,xn)
Conditional distributions
We now define conditional distributions given the first k coordinates
Definition
Let X be a random vector and suppose that the marginal pmf or pdf wrt the first k coordinates satisfies f(x1,…,xk)>0,∀(x1,…,xk)∈Rk The conditional pmf or pdf of (Xk+1,…,Xn) given X1=x1,…,Xk=xk is the function of (xk+1,…,xn) defined by f(xk+1,…,xn∣x1,…,xk):=f(x1,…,xk)fX(x1,…,xn)
Conditional distributions
Similarly, we can define the conditional distribution given the i-th coordinate
Definition
Let X be a random vector and suppose that for a given xi∈RfXi(xi)>0 The conditional pmf or pdf of X~ given Xi=xi is the function of x~ defined by f(x~∣xi):=fXi(xi)fX(x1,…,xn) where we denote X~:=(X1,…,Xi−1,Xi+1,…,Xn),x~:=(x1,…,xi−1,xi+1,…,xn)
Independence
Definition
X=(X1,…,Xn) random vector with joint pmf or pdf fX and marginals fXi. We say that the random variables X1,…,Xn are mutually independent if fX(x1,…,xn)=fX1(x1)⋅…⋅fXn(xn)=i=1∏nfXi(xi)
Proposition
If X1,…,Xn are mutually independent then for all Ai⊂RP(X1∈A1,…,Xn∈An)=i=1∏nP(Xi∈Ai)
Independence
Characterization result
Theorem
X=(X1,…,Xn) random vector with joint pmf or pdf fX. They are equivalent:
The random variables X1,…,Xn are mutually independent
There exist functions gi(xi) such that fX(x1,…,xn)=i=1∏ngi(xi)
Independence
A very useful theorem
Theorem
Let X1,…,Xn be mutually independent random variables and gi(xi) function only of xi. Then the random variables g1(X1),…,gn(Xn) are mutually independent
Let X1,…,Xn be mutually independent random variables and gi(xi) functions. Then IE[g1(X1)⋅…⋅gn(Xn)]=i=1∏nIE[gi(Xi)]
Application: MGF of sums
Theorem
Let X1,…,Xn be mutually independent random variables, with mgfs MX1(t),…,MXn(t). Define the random variable Z:=X1+…+Xn The mgf of Z satisfies MZ(t)=i=1∏nMXi(t)
Let X1,…,Xn be mutually independent random variables with normal distribution Xi∼N(μi,σi2). Define Z:=X1+…+Xn and μ:=μ1+…+μn,σ2:=σ12+…+σn2 Then Z is normally distributed with Z∼N(μ,σ2)
Example – Sum of independent Normals
Proof of Theorem
We have seen in Lecture 1 that Xi∼N(μi,σi2)⟹MXi(t)=exp(μit+2t2σi2)
As X1,…,Xn are mutually independent, from the Theorem in Slide 47, we get MZ(t)=i=1∏nMXi(t)=i=1∏nexp(μit+2t2σi2)=exp((μ1+…+μn)t+2t2(σ12+…+σn2))=exp(μt+2t2σ2)
Example – Sum of independent Normals
Proof of Theorem
Therefore Z has moment generating function MZ(t)=exp(μt+2t2σ2)
The above is the mgf of a normal distribution with mean μ and varianceσ2
Since mgfs characterize distributions (see Theorem in Slide 71 of Lecture 1), we conclude Z∼N(μ,σ2)
Example – Sum of independent Gammas
Theorem
Let X1,…,Xn be mutually independent random variables with Gamma distribution Xi∼Γ(αi,β). Define Z:=X1+…+Xn and α:=α1+…+αn Then Z has Gamma distribution Z∼Γ(α,β)
Example – Sum of independent Gammas
Proof of Theorem
We have seen in Lecture 1 that Xi∼Γ(αi,β)⟹MXi(t)=(β−t)αiβαi
As X1,…,Xn are mutually independent, from the Theorem in Slide 47, we get MZ(t)=i=1∏nMXi(t)=i=1∏n(β−t)αiβαi=(β−t)(α1+…+αn)β(α1+…+αn)=(β−t)αβα
Example – Sum of independent Gammas
Proof of Theorem
Therefore Z has moment generating function MZ(t)=(β−t)αβα
The above is the mgf of a Gamma distribution with parameters α and β
Since mgfs characterize distributions (see Theorem in Slide 71 of Lecture 1), we conclude Z∼Γ(α,β)
Expectation of sums
Expectation is linear
Theorem
For random variables X1,…,Xn and scalars a1,…,an we have IE[a1X1+…+anXn]=a1IE[X1]+…+anIE[Xn]
Variance of sums
Variance is quadratic
Theorem
For random variables X1,…,Xn and scalars a1,…,an we have Var[a1X1+…+anXn]=a12Var[X1]+…+an2Var[Xn]+2i=j∑Cov(Xi,Xj) If X1,…,Xn are mutually independent then Var[a1X1+…+anXn]=a12Var[X1]+…+an2Var[Xn]
Part 3: Random samples
iid random variables
Definition
The random variables X1,…,Xn are independent identically distributed or iid with pdf or pmf f(x) if
X1,…,Xn are mutually independent
The marginal pdf or pmf of each Xi satisfies fXi(x)=f(x),∀x∈R
Random sample
Suppose the data in an experiment consists of observations on a population
Suppose the population has distribution f(x)
Each observation is labelled Xi
We always assume that the population is infinite
Therefore each Xi has distribution f(x)
We also assume the observations are independent
Definition
The random variables X1,…,Xn are a random sample of size n from the population f(x) if X1,…,Xn are iid with pdf or pmf f(x)
Random sample
Remark: Let X1,…,Xn be a random sample of size n from the population f(x). The joint distribution of X=(X1,…,Xn) is fX(x1,…,xn)=f(x1)⋅…⋅f(xn)=i=1∏nf(xi) (since the Xis are mutually independent with distribution f)
Definition
We call fX the joint sample distribution
Random sample
Notation:
When the population distribution f(x) depends on a parameter θ we write f=f(x∣θ)
In this case the joint sample distribution is fX(x1,…,xn∣θ)=i=1∏nf(xi∣θ)
Example
Suppose a population has Exponential(β) distribution f(x∣β)=β1e−x/β, if x>0
Suppose X1,…,Xn is random sample from the population f(x∣β)
The joint sample distribution is then fX(x1,…,xn∣β)=i=1∏nf(xi∣β)=i=1∏nβ1e−xi/β=βn1e−(x1+…+xn)/β
Example
We have P(X1>2)=∫2∞f(x∣β)dx=∫2∞β1e−x/βdx=e−2/β
Thanks to iid assumption we can easily compute P(X1>2,…,Xn>2)=i=1∏nP(Xi>2)=i=1∏nP(X1>2)=P(X1>2)n=e−2n/β
Part 4: Unbiased estimators
Point estimation
Usual situation: Suppose a population has distribution f(x∣θ)
In general, the parameter θ is unknown
Suppose that knowing θ is sufficient to characterize f(x∣θ)
Example: A population could be normally distributed f(x∣μ,σ2)=2πσ1exp(−2σ2(x−μ)2),x∈R
Here μ is the mean and σ2 the variance
Knowing μ and σ2 completely characterizes the normal distribution
Point estimation
Goal: We want to make predictions about the population
In order to do that, we need to know the population distribution f(x∣θ)
It is therefore desirable to determine θ, with reasonable certainty
Definitions:
Point estimation is the procedure of estimating θ from random sample
A point estimator is any function of a random sample W(X1,…,Xn)
Point estimators are also called statistics
Unbiased estimator
Definition
Suppose W is a point estimator of a parameter θ
The bias of W is the quantity Biasθ:=IE[W]−θ
W is an unbiased estimator if Biasθ=0, that is, IE[W]=θ
Note: A point estimator W=W(X1,…,Xn) is itself a random variable. Thus IE[W] is the mean of such random variable
Next goal
We want to estimate mean and variance of a population
Unbiased estimators for such quantities are:
Sample mean
Sample variance
Estimating the population mean
Problem
Suppose to have a population with distribution f(x∣θ) We want to estimate the population meanμ:=∫Rxf(x∣θ)dx
Sample mean
Definition
The sample mean of a random sample X1,…,Xn is the statistic W(X1,…,Xn):=X:=n1i=1∑nXi
Sample mean
Sample mean is unbiased estimator of mean
Theorem
The sample mean X is an unbiased estimator of the population mean μ, that is, IE[X]=μ
Sample mean
Proof of theorem
X1,…,Xn is a random sample from f(x∣θ)
Therefore Xi∼f(x∣θ) and IE[Xi]=∫Rxf(x∣θ)dx=μ
By linearity of expectation we have IE[X]=n1i=1∑nIE[Xi]=n1i=1∑nμ=μ
This shows X is an unbiased estimator of μ
Variance of Sample mean
For reasons clear later, it is useful to compute the variance of the sample mean X
Lemma
X1,…,Xn random sample from population with mean μ and variance σ2. Then Var[X]=nσ2
Variance of Sample mean
Proof of Lemma
By assumption,the population has mean μ and variance σ2
Since Xi is sampled from the population, we have IE[Xi]=μ,Var[Xi]=σ2
Since the variance is quadratic, and the Xis are independent, Var[X]=Var[n1i=1∑nXi]=n21i=1∑nVar[Xi]=n21⋅nσ2=nσ2
Estimating the population variance
Problem
Suppose to have a population f(x∣θ) with mean μ and variance σ2. We want to estimate the population variance
Sample variance
Definition
The sample variance of a random sample X1,…,Xn is the statistic S2:=n−11i=1∑n(Xi−X)2 where X is the sample mean X:=n1i=1∑nXi
Sample variance
Equivalent formulation
Proposition
It holds that S2:=n−1∑i=1n(Xi−X)2=n−1∑i=1nXi2−nX2
Sample variance
Proof of Proposition
We have i=1∑n(Xi−X)2=i=1∑n(Xi2+X2−2XiX)=i=1∑nXi2+nX2−2Xi=1∑nXi=i=1∑nXi2+nX2−2nX2=i=1∑nXi2−nX2
Dividing by n−1 yields the desired identity S2=n−1∑i=1nXi2−nX2
Sample variance
Sample variance is unbiased estimator of variance
Theorem
The sample variance S2 is an unbiased estimator of the population variance σ2, that is, IE[S2]=σ2
Sample variance
Proof of theorem
By linearity of expectation we infer IE[(n−1)S2]=IE[i=1∑nXi2−nX2]=i=1∑nIE[Xi2]−nIE[X2]
Since Xi∼f(x∣θ), we have IE[Xi]=μ,Var[Xi]=σ2
Therefore by definition of variance, we infer IE[Xi2]=Var[Xi]+IE[X]2=σ2+μ2
Sample variance
Proof of theorem
Also recall that IE[X]=μ,Var[X]=nσ2
By definition of variance, we get IE[X2]=Var[X]+IE[X]2=nσ2+μ2
Dividing both sides by (n−1) yields the thesis IE[S2]=σ2
Additional note
The sample variance is defined by S2=n−1∑i=1n(Xi−X)2=n−1∑i=1nXi2−nX2
Where does the n−1 factor in the denominator come from?
(It would look more natural to divide by n, instead that by n−1)
The n−1 factor is caused by a loss of precision:
Ideally, the sample variance S2 should contain the population mean μ
Since μ is not available, we estimate it with the sample mean X
This leads to the loss of 1 degree of freedom
Additional note
General statistical rule: Lose 1 degree of freedom for each parameter estimated
In the case of the sample variance S2, we have to estimate one parameter (the population mean μ). Hence degrees of freedom=Sample size−No. of estimated parameters=n−1
This is where the n−1 factor comes from!
Notation
The realization of a random sample X1,…,Xn is denoted by x1,…,xn
The realization of the sample mean X is denoted x:=n1i=1∑nxi
The realization of the sample variance S2 is denoted s2=n−1∑i=1n(xi−x)2=n−1∑i=1nxi2−nx2
Capital letters denote random variables, while lowercase letters denote specific values (realizations) of those variables
defined in terms of squares of N(0,1) random variables
designed to describe variance estimation
used to define other members of the normal family
Student t-distribution
F-distribution
Why the normal family is important
Classical hypothesis testing and regression problems
The same maths solves apparently unrelated problems
Easy to compute
Statistics tables
Software
Enables the development of approximate methods in more complex (and interesting) problems
Reminder: Normal distribution
X has normal distribution with mean μ and variance σ2 if pdf is f(x):=2πσ21exp(−2σ2(x−μ)2),x∈R
In this case we write X∼N(μ,σ2)
The standard normal distribution is denoted N(0,1)
Chi-squared distribution
Definition
Definition
Let Z1,…,Zr be iid N(0,1) random variables. The chi-squared distribution with r degrees of freedom is the distribution χr2∼Z12+...+Zr2
Chi-squared distribution
Pdf characterization
Theorem
The χr2 distribution is equivalent to a Gamma distribution χr2∼Γ(r/2,1/2) Therefore the pdf of χr2 can be written in closed form as fχr2(x)=Γ(r/2)2r/2x(r/2)−1e−x/2,x>0
Chi-squared distribution
Plots of chi-squared pdf for different choices of r
Proof of Theorem – Case r=1
We start with the case r=1
Need to prove that χ12∼Γ(1/2,1/2)
Therefore we need to show that the pdf of χ12 is fχ12(x)=Γ(1/2)21/2x−1/2e−x/2,x>0
Proof of Theorem – Case r=1
To this end, notice that by definition χ12∼Z2,Z∼N(0,1)
Hence, for x>0 we can compute cdf via Fχ12(x)=P(χ12≤x)=P(Z2≤x)=P(−x≤Z≤x)=2P(0≤Z≤x) where in the last equality we used symmetry of Z around x=0
Proof of Theorem – Case r=1
Recalling the definition of standard normal pdf we get Fχ12(x)=2P(0≤Z≤x)=22π1∫0xe−t2/2dt=22π1G(x) where we set G(x):=∫0xe−t2/2dt
Proof of Theorem – Case r=1
We can now compute pdf of χ12 by differentiating the cdf
By the Fundamental Theorem of Calculus we have G′(x)=dxd(∫0xe−t2/2dt)=e−x2/2⟹G′(x)=e−x/2
Let X and Y be normal random variables. Then X and Y independent ⟺Cov(X,Y)=0
Properties of Sample Mean and Variance
Proof of Theorem
Note that Xi−X and X are normally distributed, being sums of iid normals
Therefore, we can apply the Lemma to Xi−X and X
To this end, recall that Var[X]=σ2/n
Also note that, by independence of X1,…,XnCov(Xi,Xj)={Var[Xi]0 if i=j if i=j
Properties of Sample Mean and Variance
Proof of Theorem
Using bilinearity of covariance (i.e. linearity in both arguments) Cov(Xi−X,X)=Cov(Xi,X)−Cov(X,X)=n1j=1∑nCov(Xi,Xj)−Var[X]=n1Var[Xi]−Var[X]=n1σ2−nσ2=0
By the Lemma, we infer independence of Xi−X and X
Properties of Sample Mean and Variance
Proof of Theorem
We have shown Xi−XandXindependent
By the Theorem in Slide 46, we hence have (Xi−X)2andXindependent
By the same Theorem we also get i=1∑n(Xi−X)2=(n−1)S2andXindependent
Again the same Theorem, finally implies independence of S2 and X
Properties of Sample Mean and Variance
Proof of Theorem
We now want to show that X∼N(μ,σ2/n)
We are assuming that X1,…,Xn are iid with IE[Xi]=μ,Var[Xi]=σ2
We have already seen in Slides 70 and 72 that, in this case, IE[X]=μ,Var[X]=nσ2
Sum of independent normals is normal (see the Theorem in slide 50)
Therefore X is normal, with mean μ and variance σ2/n
Properties of Sample Mean and Variance
Proof of Theorem
We are left to prove that σ2(n−1)S2∼χn−12
This is somewhat technical and we don’t actually prove it
The above is just an approximation:
When replacing μ with X, we lose 1 degree of freedomσ2(n−1)S2∼χn−12
Part 7: t-distribution
Estimating the Mean
Problem
Estimate the mean μ of a normal population
What to do?
We can collect normal samples X1,…,Xn with Xi∼N(μ,σ2)
We then compute the sample mean X:=n1i=1∑nXi
We know that IE[X]=μ
X approximates μ
Question
How good is this approximation? How to quantify it?
Answer: We consider the Test StatisticT:=σ/nX−μ∼N(0,1)
This is because X∼N(μ,σ2/n) – see Slide 101
If σ is known, then the only unknown in T is μ
T can be used to estimate μ⟹ Hypothesis Testing
Hypothesis testing
Suppose that μ=μ0 (this is called the null hypothesis)
Using the data collected x=(x1,…,xn), we compute t:=σ/nx−μ0,x=n1i=1∑nxi
When μ=μ0, the number t is a realization of the test statistic (random variable) T=σ/nX−μ0∼N(0,1)
Therefore, we can compute the probability of T being close to tp:=P(T≈t)
Hypothesis testing
Given the value p:=P(T≈t) we have 2 cases:
p is small ⟹reject the null hypothesisμ=μ0
p small means it is unlikely to observe such value of t
Recall that t depends only on the data x, and on our guess μ0
We conclude that our guess must be wrong ⟹μ=μ0
p is large ⟹do not reject the null hypothesisμ=μ0
p large means that t occurs with reasonably high probability
There is no reason to believe our guess μ0 was wrong
But we also do not have sufficient reason to believe μ0 was correct
Important Remark
The key step in Hypothesis Testing is computing p=P(T≈t)
This is only possible if we know the distribution of T=σ/nX−μ
If we assume that the variance σ2 is known, then T∼N(0,1) and p is easily computed
Unknown variance
Problem
In general, the population variance σ2 is unknown. What to do?
Idea: We can replace σ2 with the sample varianceS2=n−1∑i=1nXi2−nX2 The new test statistic is hence T:=S/nX−μ
Distribution of the test statistic
Question
What is the distribution of
T:=S/nX−μ?
Answer: T has t-distribution with n−1 degrees of freedom
This is also known as Student’s t-distribution
Student was the pen name under which W.S. Gosset was publishing his research
He was head brewer at Guinness, at the time the largest brewery in the world!
Used t-distribution to study chemical properties of barley from low samples[2] (see original paper )
t-distribution
Definition
A random variable T has Student’s t-distribution with p degrees of freedom, denoted by T∼tp, if the pdf of T is fT(t)=Γ(2p)Γ(2p+1)(pπ)1/21(1+t2/p)(p+1)/21,t∈R
Characterization of the t-distribution
Theorem
Let U∼N(0,1) and V∼χp2 be independent random variables. Then T:=V/pU∼tp, that is, T has t-distribution with p degrees of freedom.
Proof: Given as exercise in Homework assignments
Distribution of t-statistic
As a consequence of the Theorem in previous slide we obtain:
Theorem
Let X1,…,Xn be a random sample from N(μ,σ2). Then the random variable T=S/nX−μ has t-distribution with n−1 degrees of freedom, that is, T∼tn−1
Distribution of t-statistic
Proof of Theorem
Since X1,…,Xn is random sample from N(μ,σ2), we have that (see Slide 101) X∼N(μ,σ2/n)
Therefore, we can renormalize and obtain U:=σ/nX−μ∼N(0,1)
Distribution of t-statistic
Proof of Theorem
We have also shown that V:=σ2(n−1)S2∼χn−12
Finally, we can rewrite T as T=S/nX−μ=V/(n−1)U
By the Theorem in Slide 118, we conclude that T∼tn−1
Properties of t-distribution
Proposition: Expectation and Variance of t-distribution
Suppose that T∼tp. We have:
If p>1 then IE[T]=0
If p>2 then Var[T]=p−2p
Notes:
We have to assume p>1, otherwise IE[T]=∞ for p=1
We have to assume p>2, otherwise Var[T]=∞ for p=1,2
IE[T]=0 follows trivially from symmetry of the pdf fT(t) around t=0
Computing Var[T] is quite involved, and we skip it
t-distribution
Comparison with Standard Normal
The tp distribution approximates the standard normal N(0,1):
tp it is symmetric around zero and bell-shaped, like N(0,1)
tp has heavier tails compared to N(0,1)
While the variance of N(0,1) is 1, the variance of tp is p−2p
We have that tp→N(0,1)asp→∞
Plot: Comparison with Standard Normal
References
[1]
Casella, George, Berger, Roger L., Statistical inference, second edition, Brooks/Cole, 2002.
[2]
Gosset (Student), W.S., The probable error of a mean, Biometrika. 6 (1908) 1–25.