Therefore we can estimate \mu_X - \mu_Y with the sample means, that is,
\text{Estimate} = \overline{X} - \overline{Y}
The two-sample t-statistic
Since we are testing for difference in mean, we have
\text{Hypothesised value} = \mu_X - \mu_Y
The Estimated Standard Error is the standard deviation of estimator
\text{e.s.e.} = \text{Standard Deviation of } \overline{X} -\overline{Y}
The two-sample t-statistic
Therefore the two-sample t-statistic is
T = \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\text{e.s.e.}}
Under the Null Hypothesis that \mu_X = \mu_Y, the t-statistic becomes
T = \frac{\overline{X} - \overline{Y} }{\text{e.s.e.}}
A note on the degrees of freedom (df)
The general rule is
\text{df} = \text{Sample size} - \text{No. of estimated parameters}
Sample size in two-sample t-test:
n in the first sample
m in the second sample
Hence total number of observations is n + m
No. of estimated parameters is 2: Namely \mu_X and \mu_Y
Hence degree of freedoms in two-sample t-test is
{\rm df} = n + m - 2
(more on this later)
The estimated standard error
Recall: We are assuming populations have same variance
\sigma^2_X = \sigma^2_Y = \sigma^2
We need to compute the estimated standard error
\text{e.s.e.} = \text{Standard Deviation of } \ \overline{X} -\overline{Y}
Variance of sample mean was computed in the Lemma in Slide 72 Lecture 2
Since \overline{X} \sim N(\mu_X,\sigma^2) and \overline{Y} \sim N(\mu_Y,\sigma^2), by the Lemma we get
{\rm Var}[\overline{X}] = \frac{\sigma^2}{n} \,, \qquad \quad
{\rm Var}[\overline{Y}] = \frac{\sigma^2}{m}
The estimated standard error
Since X_i and Y_i are independent we get {\rm Cov}(X_i,Y_j)=0
By bilinearity of covariance we infer
{\rm Cov}( \overline{X} , \overline{Y} ) = \frac{1}{n \cdot m} \sum_{i=1}^n \sum_{j=1}^m {\rm Cov}(X_i,Y_j) = 0
From Lecture 2, we know that S_X^2 and S_Y^2 are unbiased estimators of \sigma^2, i.e.
{\rm I\kern-.3em E}[ S_X^2 ] = {\rm I\kern-.3em E}[ S_Y^2 ] = \sigma^2
Therefore, both S_X^2 and S_Y^2 can be used to estimate \sigma^2
Estimating the variance
We can improve the estimate of \sigma^2 by combining S_X^2 and S_Y^2
We will consider a (convex) linear combination
S^2 := \lambda_X S_X^2 + \lambda_Y S_Y^2 \,, \qquad
\lambda_X + \lambda_Y = 1
S^2 is still an unbiased estimator of \sigma^2, since \begin{align*}
{\rm I\kern-.3em E}[S^2] & = {\rm I\kern-.3em E}[ \lambda_X S_X^2 + \lambda_Y S_Y^2 ] \\
& = \lambda_X {\rm I\kern-.3em E}[S_X^2] + \lambda_Y {\rm I\kern-.3em E}[S_Y^2] \\
& = (\lambda_X + \lambda_Y) \sigma^2 \\
& = \sigma^2
\end{align*}
Estimating the variance
We choose coefficients \lambda_X and \lambda_Y which reflect sample sizes
\lambda_X := \frac{n - 1}{n + m - 2} \qquad
\qquad
\lambda_Y := \frac{m - 1}{n + m - 2}
Notes:
We have \lambda_X + \lambda_Y = 1
Denominators in \lambda_X and \lambda_Y are degrees of freedom {\rm df } = n + m - 2
This choice is made so that S^2 has chi-squared distribution (more on this later)
Pooled estimator of variance
Definition
The pooled estimator of \sigma^2 is defined as
S_p^2 := \lambda_X S_X^2 + \lambda_Y S_Y^2
= \frac{(n-1) S_X^2 + (m-1) S_Y^2}{n + m - 2}
Note:
n=m implies \lambda_X = \lambda_Y
In this case S_X^2 and S_Y^2 have same weight in S_p^2
The two-sample t-statistic
The t-statistic has currently the form
T = \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sigma \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}
We replace \sigma with the pooled estimator S_p
The two-sample t-statistic
Definition
The two sample t-statistic is defined as
T := \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{ S_p \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}
Note: Under the Null Hypothesis that \mu_X = \mu_Y this becomes
T = \frac{\overline{X} - \overline{Y}}{ S_p \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}
= \frac{\overline{X} - \overline{Y}}{ \sqrt{ \dfrac{ (n-1) S_X^2 + (m-1) S_Y^2 }{n + m - 2} } \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}
Distribution of two-sample t-statistic
Theorem
The two sample t-statistic has t_{n+m-2} distribution
T := \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{ S_p \ \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}} \sim t_{n + m - 2}
Distribution of two-sample t-statistic
Proof
We have already seen that \overline{X} - \overline{Y} is normal with
{\rm I\kern-.3em E}[\overline{X} - \overline{Y}] = \mu_X - \mu_Y
\qquad \qquad
{\rm Var}[\overline{X} - \overline{Y}] = \sigma^2 \left( \frac{1}{n} + \frac{1}{m} \right)
Therefore we can rescale \overline{X} - \overline{Y} to get
U := \frac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{ \sigma \sqrt{ \dfrac{1}{n} + \dfrac{1}{m}}} \sim N(0,1)
Distribution of two-sample t-statistic
Proof
We are assuming X_1, \ldots, X_n iid N(\mu_X,\sigma^2)
Therefore, as already shown, we have
\frac{ (n-1) S_X^2 }{ \sigma^2 } \sim \chi_{n-1}^2
Similarly, since Y_1, \ldots, Y_m iid N(\mu_Y,\sigma^2), we get
\frac{ (m-1) S_Y^2 }{ \sigma^2 } \sim \chi_{m-1}^2
Distribution of two-sample t-statistic
Proof
Since X_i and Y_j are independent, we also have that
\frac{ (n-1) S_X^2 }{ \sigma^2 } \quad \text{ and } \quad
\frac{ (m-1) S_Y^2 }{ \sigma^2 } \quad \text{ are independent}
Interpretation: Reject H_0 when either
p < 0.05 \qquad \text{ or } \qquad t \in \,\,\text{Rejection Region}
\qquad \qquad \qquad \qquad
(T \, \sim \, t_{n+m-1})
Alternative
Rejection Region
t^*
p-value
\mu_X \neq \mu_Y
|t| > t^*
t_{n+m-1}(0.025)
2P(T > |t|)
\mu_X < \mu_Y
t < - t^*
t_{n+m-1}(0.05)
P(T < t)
\mu_X > \mu_Y
t > t^*
t_{n+m-1}(0.05)
P(T > t)
Reject H_0 if t-statistic t falls in the Rejection Region (in gray). Here t \sim t_{n+m-1}
The two-sample t-test in R
General commands
Store the samples x_1,\ldots,x_n and y_1,\ldots,y_m in two R vectors
x <- c(x1, ..., xn)\qquady <- c(y1, ..., ym)
Perform a two-sample t-test on x and y
Alternative
R command
\mu_X \neq \mu_Y
t.test(x, y, var.equal = T)
\mu_X < \mu_Y
t.test(x, y, var.equal = T, alt = "less")
\mu_X > \mu_Y
t.test(x, y, var.equal = T, alt = "greater")
Read output: similar to one-sample t-test
The main quantity of interest is p-value
Comments on command t.test(x, y)
mu = mu0 tells R to test null hypothesis:
H_0 \colon \mu_X - \mu_Y = \mu_0 \qquad \quad (\text{default is } \, \mu_0 = 0)
var.equal = T tells R to assume that populations have same variance
\sigma_X^2 = \sigma^2_Y
In this case R computes the t-statistic with formula discussed earlier
t = \frac{ \overline{x} - \overline{y} }{s_p \sqrt{ \dfrac{1}{n} + \dfrac{1}{m} }}
Comments on command t.test(x, y)
Warning: If var.equal = T is not specified then
R assumes that populations have different variance \sigma_X^2 \neq \sigma^2_Y
In this case the t-statistic
t = \frac{ \overline{x} - \overline{y} }{s_p \sqrt{ \dfrac{1}{n} + \dfrac{1}{m} }}
is NOT t-distributed
R performs the Welch t-test instead of the classic t-test
(more on this later)
Part 3: Two-sample t-test Example
Mathematicians
x_1
x_2
x_3
x_4
x_5
x_6
x_7
x_8
x_9
x_{10}
Wages
36
40
46
54
57
58
59
60
62
63
Accountants
y_1
y_2
y_3
y_4
y_5
y_6
y_7
y_8
y_9
y_{10}
y_{11}
y_{12}
y_{13}
Wages
37
37
42
44
46
48
54
56
59
60
60
64
64
Samples: Wage data on 10 Mathematicians and 13 Accountants
Wages are independent and normally distributed
Populations have equal variance
Quesion: Is there evidence of differences in average pay?
Answer: Two-sample two-sided t-test for the hypothesis
H_0 \colon \mu_X = \mu_Y \,,\qquad
H_1 \colon \mu_X \neq \mu_Y
Degrees of freedom are {\rm df} = n + m - 2 = 10 + 13 - 2 = 21
Find corresponding critical value in Table 1
t_{21}(0.025) = 2.08
(Note the value 0.025, since this is two-sided test)
Completing the t-test
Interpretation:
We have that
| t | = 0.464 < 2.08 = t_{21}(0.025)
t falls in the acceptance region
Therefore the p-value satisfies p>0.05
There is no evidence (p>0.05) in favor of H_1
Hence we accept that \mu_X = \mu_Y
Conclusion: Average pay levels seem to be the same for both professions
The two-sample t-test in R
This is a two-sided t-test with assumption of equal variance. The p-value is
p = 2 P(t_{n-1} > |t|) \,, \qquad t = \frac{\bar{x} - \bar{y} }{s_p \ \sqrt{1/n + 1/m}}
# Enter Wages data in 2 vectors using function c()mathematicians <-c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)accountants <-c(37, 37, 42, 44, 46, 48, 54, 56, 59, 60, 60, 64, 64)# Two-sample t-test with null hypothesis mu_X = mu_Y# and equal variance assumption. Store result in answer and print.answer <-t.test(mathematicians, accountants, var.equal =TRUE)print(answer)
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
First line: R tells us that a Two-Sample t-test is performed
Second line: Data for t-test is mathematicians and accountants
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Third line:
The t-statistic computed is t = 0.46359
Note: This coincides with the one computed by hand!
There are 21 degrees of freedom
The p-values is p = 0.6477
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Fourth line: The alternative hypothesis is that the difference in means is not zero
This translates to H_1 \colon \mu_X \neq \mu_Y
Warning: This is not saying to reject H_0 – R is just stating H_1
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Fifth line: R computes a 95 \% confidence interval for \mu_X - \mu_Y
(\mu_X - \mu_Y) \in [-6.569496, 10.338727]
Interpretation: If you repeat the experiment (on new data) over and over, the interval [a,b] will contain \mu_X - \mu_Y about 95\% of the times
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Seventh line: R computes sample mean for the two populations
Sample mean for mathematicians is 53.5
Sample mean for accountants is 51.61538
Two Sample t-test
data: mathematicians and accountants
t = 0.46359, df = 21, p-value = 0.6477
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.569496 10.338727
sample estimates:
mean of x mean of y
53.50000 51.61538
Conclusion: The p-value is p = 0.6477
Since p > 0.05 we do not reject H_0
Hence \mu_X and \mu_Y appear to be similar
Average pay levels seem to be the same for both professions
Comment on Assumptions
The previous two-sample t-test was conducted under the following assumptions:
Wages data is normally distributed
The two populations have equal variance
Using R, we can plot the data to see if these are reasonable (graphical exploration)
Warnings: Even if the assumptions hold
we cannot expect the samples to be exactly normal (bell-shaped)
rather, look for approximate normality
we cannot expect the sample variances to match
rather, look for a similar spread in the data
Estimating the sample distribution in R
Suppose given a data sample stored in a vector z
If the sample is large, we can check normality by plotting the histogram of z
Example: z sample of size 1000 form N(0,1) – Its histogram resembles N(0,1)
z <-rnorm(1000) # Sample 1000 times from N(0,1) hist(z, probability =TRUE) # Plot histogram, with area scaled to 1
Drawback: Small samples \implies hard to check normality from histogram
This is true even if the data is normal
Example: z below is sample of size 9 from N(0,1) – But histogram not normal
z <-c(-0.78, -1.67, -0.38, 0.92, -0.58, 0.61, -1.62, -0.06, 0.52) # Random from N(0,1)hist(z, probability =TRUE) # Histogram not normal
Solution: Suppose given iid sample z from a distribution f
The command density(z) estimates the population distribution f
(Estimate based on the sampling distribution of z and smoothing - Not easy task)
Example: z as in previous slide. The plot of density(z) shows normal behavior
z <-c(-0.78, -1.67, -0.38, 0.92, -0.58, 0.61, -1.62, -0.06, 0.52) # Random from N(0,1)dz <-density(z) # Estimate the density of zplot(dz) # Plot the estimated density
The R object density(z) models a 1D function (the estimated distribution of z)
As such, it contains a grid of x values, with associated y values
x values are stored in vector density(z)$x
y values are stored in vector density(z)$y
These values are useful to set the axis range in a plot
dz <-density(z)plot(dz, # Plot dzxlim =range(dz$x), # Set x-axis rangeylim =range(dz$y)) # Set y-axis range
Axes range set as the min and max values of components of dz
Checking the Assumptions on our Example
# Compute the estimated distributionsd.math <-density(mathematicians)d.acc <-density(accountants)# Plot the estimated distributionsplot(d.math, # Plot d.mathxlim =range(c(d.math$x, d.acc$x)), # Set x-axis rangeylim =range(c(d.math$y, d.acc$y)), # Set y-axis rangemain ="Estimated Distributions of Wages") # Add title to plotlines(d.acc, # Layer plot of d.acclty =2) # Use different line stylelegend("topleft", # Add legend at top-leftlegend =c("Mathematicians", # Labels for legend"Accountants"), lty =c(1, 2)) # Assign curves to legend
Axes range set as the min and max values of components of d.math and d.acc
Wages data looks approximately normally distributed (roughly bell-shaped)
The two populations have similar variance (spreads look similar)
Conclusion: Two-sample t-test with equal variance is appropriate \implies accept H_0
Part 4: The Welch t-test
Samples with different variance
We just examined the two-sample t-tests
This assumes independent normal populations with equal variance
\sigma_X^2 = \sigma_Y^2
Question: What happens if variances are different?
Answer: Use the Welch Two-sample t-test
This is a generalization of the two-sample t-test to the case \sigma_X^2 \neq \sigma_Y^2
In R it is performed with t.test(x, y)
Note that we are just omitting the option var.equal = TRUE
Equivalently, you may specify var.equal = FALSE
The Welch two-sample t-test
Welch t-test consists in computing the Welch statistic
w = \frac{\overline{x} - \overline{y}}{ \sqrt{ \dfrac{s_X^2}{n} + \dfrac{s_Y^2}{m} } }
If sample sizes m,n > 5, then w is approximately t-distributed
Degrees of freedom are not integer, and depend on S_X, S_Y, n, m
If variances are similar, the welch statistic is comparable to the t-statistic
w \approx t : = \frac{ \overline{x} - \overline{y} }{s_p \sqrt{ \dfrac{1}{n} + \dfrac{1}{m} }}
Welch t-test Vs two-sample t-test
If variances are similar:
Welch statistic and t-statistic are similar
p-value from Welch t-test is similar to p-value from two-sample t-test
Since p-values are similar, most times the 2 tests yield same decision
The tests can be used interchangeably
If variances are very different:
Welch statistic and t-statistic are different
p-values from the two tests can differ a lot
The two tests might give different decision
Wrong to apply two-sample t-test, as variances are different
The Welch two-sample t-test in R
# Enter Wages datamathematicians <-c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)accountants <-c(37, 37, 42, 44, 46, 48, 54, 56, 59, 60, 60, 64, 64)# Perform Welch two-sample t-test with null hypothesis mu_X = mu_Y# Store result of t.test in answeranswer <-t.test(mathematicians, accountants)# Print answerprint(answer)
Note:
This is almost the same code as in Slide 86
Only difference: we are omitting the option var.equal = TRUE in t.test
Welch Two Sample t-test
data: mathematicians and accountants
t = 0.46546, df = 19.795, p-value = 0.6467
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.566879 10.336109
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
First line: R tells us that a Welch Two-Sample t-test is performed
The rest of the output is similar to classic t-test
Main difference is that p-value and t-statistic differ from classic t-test
Welch Two Sample t-test
data: mathematicians and accountants
t = 0.46546, df = 19.795, p-value = 0.6467
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.566879 10.336109
sample estimates:
mean of x mean of y
53.50000 51.61538
Comments on output:
Third line:
The Welch t-statistic is w = 0.46546 (standard t-test gave t = 0.46359)
Degrees of freedom are fractionary \rm{df} = 19.795 (standard t-test \rm{df} = 21)
The Welch t-statistic is approximately t-distributed with W \approx t_{19.795}
Fifth line: The confidence interval for \mu_X - \mu_Y is also different
Welch Two Sample t-test
data: mathematicians and accountants
t = 0.46546, df = 19.795, p-value = 0.6467
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.566879 10.336109
sample estimates:
mean of x mean of y
53.50000 51.61538
Conclusion: The p-values obtained with the 2 tests are almost the same
# Perform two-sample t-test and retrieve p-valueans <-t.test(trA, trB, alt ="less", var.equal = T) ans$p.value
[1] 0.05836482
The p-value is p > 0.05
H_0 cannot be rejected \implies There is no evidence that Treatment A is better
Wrong conclusion, because two-sample t-test does not apply
Disclaimer
The previous data was synthetic, and the background story was made up!
Nonetheless, the example is still valid
To construct the data, I sampled as follows
Treatment A: Sample of size 6 from N(-2,1)
Treatment B: Sample of size 12 from N(-1.5,9)
We see that
\mu_A < \mu_B \,, \qquad
\sigma_A^2 \neq \sigma_B^2
This tells us that:
We can expect that some samples will support that \mu_A < \mu_B
Two-sample t-test is inappropriate because \sigma_A^2 \neq \sigma_B^2
Generating the Data
Click here to see the code I used
# Set seed for random generation# This way you always get the same random numbers when# you run this codeset.seed(21) repeat {# Generate random samples x <-rnorm(6, mean =-2, sd =1) y <-rnorm(12, mean =-1.5, sd =3)# Round x and y to 1 decimal point x <-round(x, 1) y <-round(y, 1)# Perform one-sided t-tests for alternative hypothesis mu_x < mu_y ans_welch <-t.test(x, y, alt ="less", var.equal = F) ans_t_test <-t.test(x, y, alt ="less", var.equal = T)# Check that Welch test succeeds and two-sample test failsif (ans_welch$p.value <0.05&& ans_t_test$p.value >0.05) {cat("Data successfully generated!!!\n\n")cat("Synthetic Data TrA:", x, "\n")cat("Synthetic Data TrB:", y, "\n\n")cat("Welch t-test p-value:", ans_welch$p.value, "\n")cat("Two-sample t-test p-value:", ans_t_test$p.value)break }}
Logical vectors are extremely useful to evaluate conditions
Example:
given a numerical vector x
we want to count how many entries are above a value t
# Generate a vector containing sequence 1 to 8x <-seq(from =1 , to =8, by =1)# Generate vector of flags for entries strictly above 5y <- ( x >5 )cat("Vector x is: (", x, ")")cat("Entries above 5 are: (", y, ")")
Since there are many (1000) entries, we expect a result close to 500
This is because sample mean converges to true mean0
Question: How to do this?
Hint: T/F are interpreted as 1/0 in arithmetic operations
T + T
[1] 2
T + F
[1] 1
F + F
[1] 0
F + T +3
[1] 4
Logical vectors – Application
The function sum(x) sums the entries of a vector x
We can use sum(x) to count the number of T entries in a logical vector x
x <-rnorm(1000) # Generates vector with 1000 normal entriesy <- (x >0) # Generates logical vector of entries above 0above_zero <-sum(y) # Counts entries above zerocat("Number of entries which are above the average 0 is", above_zero)cat("This is pretty close to 500!")
Number of entries which are above the average 0 is 527
This is pretty close to 500!
Missing values
In practical data analysis, a data point is frequently unavailable
Statistical software needs ways to deal with this
R allows vectors to contain a special NA value - Not Available
NA is carried through in computations: operations on NA yield NA as the result
2*NA
[1] NA
NA+NA
[1] NA
T +NA
[1] NA
Indexing vectors
Components of a vector can be retrieved by indexing
# Create a vector xx <-c(11, 22, 33, 44, 55, 66, 77, 88, 99, 100)# Print vector xcat("Vector x is:", x)# Delete 2nd, 3rd and 7th entries of xx <- x[ -c(2, 3, 7) ]# Print x againcat("Vector x with 2nd, 3rd and 7th entries removed:", x)
Vector x is: 11 22 33 44 55 66 77 88 99 100
Vector x with 2nd, 3rd and 7th entries removed: 11 44 55 66 88 99 100
Logical Subsetting
You can index or slice vectors by entering explicit indices
You can also index vectors, or subset, by using logical flag vectors:
Element is extracted if corresponding entry in the flag vector is TRUE
Logical flag vectors should be the same length as vector to subset
Code: Suppose given a vector x
Create a flag vector by using
flag <- condition(x)
condition() is any function which returns T/F vector of same length as x
Subset x by using
x[flag]
Logical Subsetting
Example
The following code extracts negative components from a numeric vector
This can be done by using
x[ x < 0 ]
# Create numeric vector xx <-c(5, -2.3, 4, 4, 4, 6, 8, 10, 40221, -8)# Get negative components from x and store them in neg_xneg_x <- x[ x <0 ]cat("Vector x is:", x)cat("Negative components of x are:", neg_x)
Vector x is: 5 -2.3 4 4 4 6 8 10 40221 -8
Negative components of x are: -2.3 -8
Logical Subsetting
Example
The following code extracts components falling between a and b
This can be done by using logical operator and &
x[ (x > a) & (x < b) ]
# Create numeric vectorx <-c(5, -2.3, 4, 4, 4, 6, 8, 10, 40221, -8)# Get components between 0 and 100range_x <- x[ (x >0) & (x <100) ]cat("Vector x is:", x)cat("Components of x between 0 and 100 are:", range_x)
Vector x is: 5 -2.3 4 4 4 6 8 10 40221 -8
Components of x between 0 and 100 are: 5 4 4 4 6 8 10
The function Which
which() allows to convert a logical vector flag into a numeric index vector
which(flag) is vector of indices of flag which correspond to TRUE
# Create a logical flag vectorflag <-c(T, F, F, T, F)# Indices for flag whichtrue_flag <-which(flag)cat("Flag vector is:", flag)cat("Positions for which Flag is TRUE are:", true_flag)
Flag vector is: TRUE FALSE FALSE TRUE FALSE
Positions for which Flag is TRUE are: 1 4
The function Which – Application
which() can be used to delete certain entries from a vector x
Create a flag vector by using
flag <- condition(x)
condition() is any function which returns T/F vector of same length as x
Delete entries flagged by condition using the code
x[ -which(flag) ]
The function Which – Application
Example
# Create numeric vector xx <-c(5, -2.3, 4, 4, 4, 6, 8, 10, 40221, -8)# Print xcat("Vector x is:", x)# Flag positive components of xflag_pos_x <- (x >0)# Remove positive components from x x <- x[ -which(flag_pos_x) ]# Print x againcat("Vector x with positive components removed:", x)
Vector x is: 5 -2.3 4 4 4 6 8 10 40221 -8
Vector x with positive components removed: -2.3 -8
Functions that create vectors
The main functions to generate vectors are
c() concatenate
seq() sequence
rep() replicate
We have already met c() and seq() but there are more details to discuss
Concatenate
Recall:c() generates a vector containing the input values
# Generate a vector of values 1, 2, 3, 4, 5x <-c(1, 2, 3, 4, 5)# Print the vectorprint(x)
[1] 1 2 3 4 5
Concatenate
c() can also concatenate vectors
This was you can add entries to an existing vector
# Create 2 vectorsx <-c(1, 2, 3, 4, 5)y <-c(6, 7, 8)# Concatenate vectors x and y, and also add element 9z <-c(x, y, 9)# Print the resulting vectorprint(z)
[1] 1 2 3 4 5 6 7 8 9
Concatenate
You can assign names to vector elements
This modifies the way the vector is printed
# We specify a vector with 3 named entriesx <-c(first ="Red", second ="Green", third ="Blue")# Print the named vectorprint(x)
first second third
"Red" "Green" "Blue"
Concatenate
Given a named vector x
Names can be extracted with names(x)
Values can be extracted with unname(x)
# Create named vectorx <-c(first ="Red", second ="Green", third ="Blue")# Access names of x via names(x)names_x <-names(x)# Access values of x via unname(x)values_x <-unname(x)cat("Names of x are:", names(x))cat("Values of x are:", unname(x))
Names of x are: first second third
Values of x are: Red Green Blue
Concatenate
All elements of a vector have the same type
Concatenating vectors of different types leads to conversion
c(FALSE, 2) # Converts FALSE to 0
[1] 0 2
c(pi, "stats") # Converts pi to string
[1] "3.14159265358979" "stats"
c(TRUE, "stats") # Converts TRUE to string
[1] "TRUE" "stats"
Sequence
Recall the syntax of seq is
seq(from =, to =, by =, length.out =)
Omitting the third argument assumes that by = 1
# The following generates a vector of integers from 1 to 6x <-seq(1, 6)print(x)
[1] 1 2 3 4 5 6
Sequence
seq(x1, x2) is equivalent to x1:x2
Syntax x1:x2 is preferred to seq(x1, x2)
# Generate two vectors of integers from 1 to 6x <-seq(1, 6)y <-1:6cat("Vector x is:", x)cat("Vector y is:", y)cat("They are the same!")
Vector x is: 1 2 3 4 5 6
Vector y is: 1 2 3 4 5 6
They are the same!
Replicate
rep generates repeated values from a vector:
x vector
n integer
rep(x, n) repeats n times the vector x
# Create a vector with 3 componentsx <-c(2, 1, 3)# Repeats 4 times the vector xy <-rep(x, 4)cat("Original vector is:", x)cat("Original vector repeated 4 times:", y)
The second argument of rep() can also be a vector:
Given x and y vectors
rep(x, y) repeats entries of x as many times as corresponding entries of y
x <-c(2, 1, 3) # Vector to replicatey <-c(1, 2, 3) # Vector saying how to replicate z <-rep(x, y) # 1st entry of x is replicated 1 time# 2nd entry of x is replicated 2 times# 3rd entry of x is replicated 3 timescat("Original vector is:", x)cat("Original vector repeated is:", z)
Original vector is: 2 1 3
Original vector repeated is: 2 1 1 3 3 3
Replicate
rep() can be useful to create vectors of labels
Example: Suppose we want to collect some numeric data on 3 Cats and 4 Dogs
x <-c("Cat", "Dog") # Vector to replicatey <-rep(x, c(3, 4)) # 1st entry of x is replicated 3 times# 2nd entry of x is replicated 4 timescat("Vector of labels is:", y)
Comments on command
t.test(x, y)mu = mu0tells R to test null hypothesis: H_0 \colon \mu_X - \mu_Y = \mu_0 \qquad \quad (\text{default is } \, \mu_0 = 0)var.equal = Ttells R to assume that populations have same variance \sigma_X^2 = \sigma^2_YIn this case R computes the t-statistic with formula discussed earlier t = \frac{ \overline{x} - \overline{y} }{s_p \sqrt{ \dfrac{1}{n} + \dfrac{1}{m} }}