Display system information and load tidyverse and faraway packages
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.4.1 fastmap_1.2.0 cli_3.6.4
[5] tools_4.4.1 htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10
[9] rmarkdown_2.29 knitr_1.49 jsonlite_1.9.1 xfun_0.51
[13] digest_0.6.37 rlang_1.1.5 evaluate_1.0.3
library(tidyverse)library(faraway)
1 Introduction
The pulp data set contains data from an experiment to test the paper brightness depending on a shift operator.
pulp <-as_tibble(pulp) %>%print(n =Inf)
# A tibble: 20 × 2
bright operator
<dbl> <fct>
1 59.8 a
2 60 a
3 60.8 a
4 60.8 a
5 59.8 a
6 59.8 b
7 60.2 b
8 60.4 b
9 59.9 b
10 60 b
11 60.7 c
12 60.7 c
13 60.5 c
14 60.9 c
15 60.3 c
16 61 d
17 60.8 d
18 60.6 d
19 60.5 d
20 60.5 d
Graphical summary:
pulp %>%ggplot(mapping =aes(x = operator, y = bright)) +geom_jitter(width=0.1, height=0.0) +labs(x ="Operator", y ="Bright")
Research question is whether there is any difference in the paper brightness produced by different shift operators.
We can answer the question using the classical ANOVA (analysis of variance) \[
y_{ij} = \mu + \alpha_i + \epsilon_{ij}, \quad i = 1,\ldots,4, j = 1,\ldots,5,
\] where \(\epsilon_{ij}\) are assumed to be iid \(N(0,\sigma^2)\).
options(contrasts =c("contr.sum", "contr.poly"))lmod <-lm(bright ~ operator, data = pulp)summary(lmod)
Call:
lm(formula = bright ~ operator, data = pulp)
Residuals:
Min 1Q Median 3Q Max
-0.440 -0.195 -0.070 0.175 0.560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.40000 0.07289 828.681 <2e-16 ***
operator1 -0.16000 0.12624 -1.267 0.223
operator2 -0.34000 0.12624 -2.693 0.016 *
operator3 0.22000 0.12624 1.743 0.101
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.326 on 16 degrees of freedom
Multiple R-squared: 0.4408, Adjusted R-squared: 0.3359
F-statistic: 4.204 on 3 and 16 DF, p-value: 0.02261
The contrast contr.sum forces the coefficients for 4 operators sum to 0 (when using the dummy/one-hot coding). So the coeffiicient for the 4th operator is 0.28.
The factor operator is significant with p-value 0.023.
anova(lmod, test ="F")
Analysis of Variance Table
Response: bright
Df Sum Sq Mean Sq F value Pr(>F)
operator 3 1.34 0.44667 4.2039 0.02261 *
Residuals 16 1.70 0.10625
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The hypothesis being tested in the ANOVA is are these four operators same in terms of the paper brightness they produce. This is called a fixed effects model because the effects \(\alpha_i\) are assumed to be fixed.
Suppose we are interested in the research question are the population of operators all same based on this data set? We will arrive a random effects model \[
y_{ij} = \mu + \alpha_i + \epsilon_{ij}, \quad i = 1,\ldots,4, j = 1,\ldots,5,
\] where we assume
\(\alpha_i\) are iid from \(N(0, \sigma_{\alpha}^2)\),
\(\epsilon_{ij}\) are iid \(N(0,\sigma_{\epsilon}^2)\), and
\(\alpha_i\) and \(\epsilon_{ij}\) are jointly independent.
This is called a random effects model because now the effects \(\alpha_i\) are assumed to be random. The parameters in this random effects model are \(\mu\), \(\sigma_\alpha^2\), and \(\sigma_\epsilon^2\). The latter two are also called the variance component parameters. To test the research hypothesis, we would test \(H_0: \sigma_\alpha^2=0\).
Differences between a fixed effects model and a random effects model.
the interpretation is different,
random effects model can be more parsimonious (less parameters) than the corresponding fixed effects model,
observations in random effects model are correlated.
In this chapter, we study estimation and inference for mixed effects models.
2 Mixed effects model
Traditionally, linear models can be divided into three categories:
Fixed effects model: \(\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), where \(\boldsymbol{\beta}\) is fixed.
Random effects model: \(\mathbf{y} = \mathbf{Z} \boldsymbol{\gamma} + \boldsymbol{\epsilon}\), where \(\boldsymbol{\gamma}\) is random.
Mixed effects model or mixed model: \(\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\gamma} + \boldsymbol{\epsilon}\), where \(\boldsymbol{\beta}\) is fixed and \(\boldsymbol{\gamma}\) is random.
In a mixed effects model \[\begin{eqnarray*}
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\gamma} + \boldsymbol{\epsilon}
\end{eqnarray*}\] where
\(\mathbf{X} \in \mathbb{R}^{n \times p}\) is a design matrix for fixed effects \(\boldsymbol{\beta} \in \mathbb{R}^p\),
\(\mathbf{Z} \in \mathbb{R}^{n \times q}\) is a design matrix for random effects \(\boldsymbol{\gamma} \in \mathbb{R}^q\),
The most common assumption is \(\boldsymbol{\epsilon} \sim N(\mathbf{0}_n, \sigma^2 \mathbf{I})\), \(\boldsymbol{\gamma} \sim N(\mathbf{0}_q, \boldsymbol{\Sigma})\), and \(\boldsymbol{\epsilon}\) is independent of \(\boldsymbol{\gamma}\).
Primary goal of the mixed model (aka variance components model) is to
estimation and testing of the fixed effects \(\boldsymbol{\beta}\),
estimation and testing of the variance component parameters, and
prediction.
3 Example: one-way ANOVA
For the one-way ANOVA random effects model with \(a\) levels and \(n_i\) observations in level \(i\), \[
y_{ij} = \mu + \alpha_i + \epsilon_{ij}, \quad i=1,\ldots,a,
\] we recognize it as a mixed effects model \[\begin{eqnarray*}
\mathbf{y} = \mathbf{1}_{\sum_i n_i} \mu + \begin{pmatrix}
\mathbf{1}_{n_1} & & \\
& \vdots & \\
& & \mathbf{1}_{n_a}
\end{pmatrix} \boldsymbol{\gamma} + \boldsymbol{\epsilon},
\end{eqnarray*}\] where \(\boldsymbol{\gamma} = (\alpha_1, \ldots, \alpha_a)^T \sim N(\mathbf{0}_a, \sigma_{\alpha}^2 \mathbf{I}_a)\) and \(\boldsymbol{\epsilon} \sim N(\mathbf{0}_{\sum_i n_i}, \sigma_{\epsilon}^2 \mathbf{I})\) are independent. Note in \(\mathbf{Z}\) we have one column for each level.
How do we estimate parameters \(\mu\), \(\sigma_{\alpha}^2\), and \(\sigma_{\epsilon}^2\)?
4 Estimation
4.1 ANOVA estimator
Consider the one-way ANOVA case. Assume we have a balanced design: \(n_1 = \cdots = n_a = n\). That is each level has the same number of observations \(n\). For example \(n=5\) in the pulp example.
Because \(\mathbb{E} Y_{ij} = \mu\). So we can estimate \(\mu\) by average of \(y_{ij}\)
mean(pulp$bright)
[1] 60.4
To estimate the variance component parameters \(\sigma_a^2\) and \(\sigma_\epsilon^2\), the familiar ANOVA table gives the partition \[\begin{eqnarray*}
\text{SST} &=& \text{SSE} + \text{SSA} \\
\sum_{i=1}^a \sum_{j=1}^n (y_{ij} - \bar{y}_{\cdot \cdot})^2 &=& \sum_{i=1}^a \sum_{j=1}^n (y_{ij} - \bar{y}_{i \cdot})^2 + \sum_{i=1}^a \sum_{j=1}^n (\bar{y}_{i \cdot} - \bar{y}_{\cdot \cdot})^2.
\end{eqnarray*}\] Now we have (show in HW5) \[\begin{eqnarray*}
\mathbb{E} (\text{SSE}) &=& a(n-1) \sigma_{\epsilon}^2 \\
\mathbb{E} (\text{SSA}) &=& (a-1)(n \sigma_{\alpha}^2 + \sigma_{\epsilon}^2),
\end{eqnarray*}\] which can be solved to obtain estimators \[\begin{eqnarray*}
\widehat{\sigma}_{\epsilon}^2 &=& \frac{\text{SSE}}{a(n-1)} = \text{MSE}, \\
\widehat{\sigma}_{\alpha}^2 &=& \frac{\text{SSA}/(a-1) - \widehat{\sigma}_{\epsilon}^2}{n} = \frac{\text{MSA} - \text{MSE}}{n}.
\end{eqnarray*}\]
For the pulp data, the ANOVA table is
(aovmod <-aov(bright ~ operator, data = pulp) %>%summary())
Df Sum Sq Mean Sq F value Pr(>F)
operator 3 1.34 0.4467 4.204 0.0226 *
Residuals 16 1.70 0.1062
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
str(aovmod)
List of 1
$ :Classes 'anova' and 'data.frame': 2 obs. of 5 variables:
..$ Df : num [1:2] 3 16
..$ Sum Sq : num [1:2] 1.34 1.7
..$ Mean Sq: num [1:2] 0.447 0.106
..$ F value: num [1:2] 4.2 NA
..$ Pr(>F) : num [1:2] 0.0226 NA
- attr(*, "class")= chr [1:2] "summary.aov" "listof"
When MSA<MSE, we obtain negative \(\widehat{\sigma}_{\alpha}^2\).
Hard to generalize to unbalanced ANOVA or more complicated designs.
4.2 Maximum likelihood estimation (MLE)
For the mixed effects model \[\begin{eqnarray*}
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\gamma} + \boldsymbol{\epsilon},
\end{eqnarray*}\] where \(\boldsymbol{\epsilon} \sim N(\mathbf{0}_n, \sigma^2 \mathbf{I})\) and \(\boldsymbol{\gamma} \sim N(\mathbf{0}_q, \boldsymbol{\Sigma})\) are independent of each other, we have \[
\mathbf{Y} \sim N(\mathbf{X} \boldsymbol{\beta}, \mathbf{Z} \boldsymbol{\Sigma} \mathbf{Z}^T + \sigma^2 \mathbf{I}).
\] So the likelihood is \[
\frac{1}{(2\pi)^{n/2} \det(\mathbf{Z} \boldsymbol{\Sigma} \mathbf{Z}^T + \sigma^2 \mathbf{I})^{1/2}} e^{- \frac 12 (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^T (\mathbf{Z} \boldsymbol{\Sigma} \mathbf{Z}^T + \sigma^2 \mathbf{I})^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})}
\] and the log-likelihood is \[
\ell(\boldsymbol{\beta}, \boldsymbol{\Sigma}, \sigma^2) = - \frac n2 \log(2\pi) - \frac 12 \log \det (\mathbf{Z} \boldsymbol{\Sigma} \mathbf{Z}^T + \sigma^2 \mathbf{I}) - \frac 12 (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^T (\mathbf{Z} \boldsymbol{\Sigma} \mathbf{Z}^T + \sigma^2 \mathbf{I})^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}).
\]
We can maximize the log-likelihood function to obtain MLE.
Exercise (HW5): Derive the MLE for the balanced one-way ANOVA example.
One drawback of MLE is that it produces biased estimate for the variance component parameters \(\sigma^2\) and \(\boldsymbol{\Sigma}\). E.g., in the linear regression case (fixed effects model), MLE for \(\sigma^2\) is \[
\widehat{\sigma}_{\text{MLE}}^2 = \frac{\text{RSS}}{n},
\] where an unbiased estimator is \[
\widehat{\sigma}^2 = \frac{\text{RSS}}{n-p},
\] where \(p\) is the number of parameters.
Let’s find the MLE for the pulp data using lme4 package. The syntax (1 | operator) instructs that the data is grouped by operator and the 1 indicates that random effect is constant within each group. The default method is REML so we set REML = FALSE to compute the MLE instead.
library(lme4)smod <-lmer(bright ~1+ (1| operator), data = pulp, REML =FALSE)summary(smod)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: bright ~ 1 + (1 | operator)
Data: pulp
AIC BIC logLik -2*log(L) df.resid
22.5 25.5 -8.3 16.5 17
Scaled residuals:
Min 1Q Median 3Q Max
-1.50554 -0.78116 -0.06353 0.65850 1.56232
Random effects:
Groups Name Variance Std.Dev.
operator (Intercept) 0.04575 0.2139
Residual 0.10625 0.3260
Number of obs: 20, groups: operator, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) 60.4000 0.1294 466.7
4.3 Restricted MLE (REML)
The restricted maximum likelihood (REML) method tries to reduce the bias in variance component estimates.
Assume \(\mathbf{X}\) has full column rank \(p\). Let \(\mathbf{K} \in \mathbb{R}^{n \times (n-p)}\) be an basis of the space \({\cal N}(\mathbf{X}^T)\), which is orthogonal to \({\cal C}(\mathbf{X})\). Then \[
\mathbf{K}^T \mathbf{Y} \sim N(\mathbf{0}_{n-p}, \mathbf{K}^T (\mathbf{Z} \boldsymbol{\Sigma} \mathbf{Z}^T + \sigma^2 \mathbf{I}) \mathbf{K}).
\] We estimate variance component parameters \((\sigma^2, \boldsymbol{\Sigma})\) by MLE using this transformed data \((\mathbf{K}^T \mathbf{y}, \mathbf{K}^T \mathbf{Z})\). Then we estimate fixed effects \(\boldsymbol{\beta}\) using general least squares. It can be shown that the REML estimate does not depend on the choice of \(\mathbf{K}\).
Let’s find the REML for the pulp data using lme4.
library(lme4)mmod <-lmer(bright ~1+ (1| operator), data = pulp, REML =TRUE)summary(mmod)
Linear mixed model fit by REML ['lmerMod']
Formula: bright ~ 1 + (1 | operator)
Data: pulp
REML criterion at convergence: 18.6
Scaled residuals:
Min 1Q Median 3Q Max
-1.4666 -0.7595 -0.1244 0.6281 1.6012
Random effects:
Groups Name Variance Std.Dev.
operator (Intercept) 0.06808 0.2609
Residual 0.10625 0.3260
Number of obs: 20, groups: operator, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) 60.4000 0.1494 404.2
We found the REML estimate is exactly same as ANOVA estimate. Exercise (HW5): show this.
5 Inference
5.1 LRT and adjusted F test for fixed effects
If we compare two nested models that differ only in fixed effects, we can use the standard likelihood ratio test (LRT). Remember we have to use MLE (not REML) for LRT.
F tests for fixed effects need to use adjusted degrees of freedom (Kenward-Roger adjusted F-test for REML), as implemented in the pbkrtest package.
5.2 Parametric bootstrap
Testing the variance component parameters, e.g., \(H_0:\sigma_{\alpha}^2=0\), can be hard because of the boundary condition (estimator has to be nonnegative). Conventional \(\chi^2\) null distribution for LRT can be wrong. E.g., for the pulp data, LRT using the the conventional \(\chi_1^2\) null distribution gives a non-significant p-value.
# fishy result using LRTpchisq(lrtstat, 1, lower.tail =FALSE)
[1] 0.1090199
The idea of parameteric bootstrap is we generate new \(y\) from the fitted null model many times. For each simulation replicate, we calculate the LRT statistic. Then the p-value is estimated by the proportion of simulation replicates that generate LRT statistics larger than the observed one \(2.5684\).
B <-1000lrstat <-numeric(B)set.seed(123) # for reproducibilityfor (i in1:B) { by <-unlist(simulate(nullmod)) bnull <-lm(by ~1)#balt <- refitML(smod, by) balt <-suppressMessages(lmer(by ~1+ (1| operator), pulp, REML =FALSE)) lrstat[i] <-as.numeric(2* (logLik(balt, REML =FALSE) -logLik(bnull)))}
It’s not restricted to MLE estimators. It applies to many other estimators.
It does not rely on the dubious asymptotic \(\chi^2\) distribution.
It can be used to generate confidence intervals.
It applies to inference of both \(\boldsymbol{\beta}\) and variance component parameters.
Drawbacks of parametric bootstrap: computationally intensive.
5.3 Exact LRT and RLRT test
Another simulation based method is provided by the RLRsim package.
library(RLRsim)exactLRT(smod, nullmod)
simulated finite sample distribution of LRT. (p-value based on 10000
simulated values)
data:
LRT = 2.5684, p-value = 0.0241
exactRLRT(mmod)
simulated finite sample distribution of RLRT.
(p-value based on 10000 simulated values)
data:
RLRT = 3.4701, p-value = 0.0224
6 Estimate random effects
Consider the mixed effects model \[
\begin{eqnarray*}
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\gamma} + \boldsymbol{\epsilon},
\end{eqnarray*}
\] where \(\boldsymbol{\epsilon} \sim N(\mathbf{0}_n, \sigma^2 \mathbf{I})\) and \(\boldsymbol{\gamma} \sim N(\mathbf{0}_q, \boldsymbol{\Sigma})\) are independent of each other. The random effects \(\boldsymbol{\gamma}\) are random variables. It does not make sense to estimate \(\boldsymbol{\gamma}\).
However, if we take the Bayesian point of view, we can estimate it by its posterior mean. We have a likelihood\[
\mathbf{y} \mid \boldsymbol{\gamma} \sim N(\mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\gamma}, \sigma^2 \mathbf{I}_n)
\] and a prior distribution\[
\boldsymbol{\gamma} \sim N(\mathbf{0}_q, \boldsymbol{\Sigma}).
\] Then by the Bayes theorem, the posterior distribution is \[\begin{eqnarray*}
f(\boldsymbol{\gamma} \mid \mathbf{y}) &=& \frac{f(\mathbf{y} \mid \boldsymbol{\gamma}) \times f(\boldsymbol{\gamma})}{f(\mathbf{y})} \\
&=& ...
\end{eqnarray*}\] is a multivariate normal with mean (show in HW5) \[
\mathbb{E} (\boldsymbol{\gamma} \mid \mathbf{y}) = \boldsymbol{\Sigma} \mathbf{Z}^T (\mathbf{Z} \boldsymbol{\Sigma} \mathbf{Z}^T + \sigma^2 \mathbf{I})^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}).
\]
We can use this posterior mean to estimate random effects. For the pulp data, estimate of random effects is obtained by
ranef(mmod)$operator
(Intercept)
a -0.1219403
b -0.2591231
c 0.1676679
d 0.2133955
Compare these to the coefficients from fixed effects model:
(cc <-model.tables(aov(bright ~ operator, data = pulp)))
Tables of effects
operator
operator
a b c d
-0.16 -0.34 0.22 0.28
We found the estimated random effects are uniformly smaller than the fixed effects
cc[[1]]$operator /ranef(mmod)$operator
(Intercept)
a 1.312117
b 1.312117
c 1.312117
d 1.312117
That’s why Bayesian estimates are often called the shrinkage estimator. Exercise: prove this in HW5 for the balanced one-way ANOVA example.
7 Prediction
For the observed levels (an operator we know), the best linear unbiased predictors (BLUP) is obtained by
fixef(mmod) +ranef(mmod)$operator
(Intercept)
a 60.27806
b 60.14088
c 60.56767
d 60.61340
or
predict(mmod, newdata =data.frame(operator="a"))
1
60.27806
For a new operator, the best we can do is the intercept.
Diagnostic plots for random effects usually use the residuals calculated using predicted random effects. These residuals are regarded as estimates of \(\epsilon\).
In the penicillin data set, we want to study how yield depends on the treat (proccess) and blend (corn steep liquor). It is natural to treat treat as fixed effects and blend as a blocking variable (random effects).
penicillin
treat blend yield
1 A Blend1 89
2 B Blend1 88
3 C Blend1 97
4 D Blend1 94
5 A Blend2 84
6 B Blend2 77
7 C Blend2 92
8 D Blend2 79
9 A Blend3 81
10 B Blend3 87
11 C Blend3 87
12 D Blend3 85
13 A Blend4 87
14 B Blend4 92
15 C Blend4 89
16 D Blend4 84
17 A Blend5 79
18 B Blend5 81
19 C Blend5 80
20 D Blend5 88
Graphical summary
penicillin %>%ggplot() +geom_point(mapping =aes(x = treat, y = yield, color = blend))
penicillin %>%ggplot() +geom_point(mapping =aes(x = blend, y = yield, color = treat))
Classical two-way ANOVA:
op <-options(contrasts =c("contr.sum", "contr.poly"))lmod <-aov(yield ~ blend + treat, data = penicillin)summary(lmod)
Df Sum Sq Mean Sq F value Pr(>F)
blend 4 264 66.00 3.504 0.0407 *
treat 3 70 23.33 1.239 0.3387
Residuals 12 226 18.83
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# p-value estimated by parametric bootstrap(pval <-mean(lrstat >4.0474))
[1] 0.352
with standard error
sqrt(pval * (1- pval) / B)
[1] 0.01510285
The pbkrtest packages automate this parametric boostrap procedure (PBtest) along with LRT and other tests for fixed effects. LRT and parametric bootstrap results are similar to what we got. Note the F test here is the LRT divided by the degrees of freedom assumed to be F-distributed.