faraway package contains the datasets in the ELMR book.
1 O-ring example
In January 1986, the space shuttle Challenger exploded 73 seconds after launch.
The culprit is the O-ring seals in the rocket boosters. At lower temperatures, rubber becomes more brittle and is a less effective sealant. At the time of the launch, the temperature was 31°F.
Could the failure of the O-rings have been predicted?
Data: 23 previous shuttle missions. Each shuttle had 2 boosters, each with 3 O-rings. We know the number of O-rings out of 6 showing some damage and the launch temperature.
There seems a pattern: lower temperature, more damages.
oring_plot <- orings %>%mutate(failrate = damage /6) %>%ggplot(mapping =aes(x = temp, y = failrate)) +geom_point() +labs(x ="Temperature (F)", y ="Proportion of damages", title ="O-ring damage data")plot(oring_plot)
3 Binomial model
We model \(Y_i\) as a Binomial random variable with batch size \(m_i\) and “success” probability \(p_i\)\[
\mathbb{P}(Y_i = y_i) = \binom{m_i}{y_i} p_i^{y_i} (1 - p_i)^{m_i - y_i}.
\]
The parameter \(p_i\) is linked to the predictors \(X_1, \ldots, X_{q}\) via an inverse link function\[
p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}},
\] where \(\eta_i\) is the linear predictor or systematic component\[
\eta_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{q} x_{iq} = \mathbf{x}_i^T \boldsymbol{\beta}
\]
The Bernoulli model in ELMR 2 is a special case with all batch sizes \(m_i = 1\).
Conversely, the Binomial model is equivalent to a Bernoulli model with \(\sum_i m_i\) observations, or a Bernoulli model with observation weights \((y_i, m_i - y_i)\).
4 Logistic regression
For Binomial model, we input \((\text{successes}_i, \text{failures}_i)\) as responses.
lmod <-glm(cbind(damage, 6- damage) ~ temp, family = binomial, data = orings)lmod %>%tbl_regression(intercept =TRUE)
Characteristic
log(OR)
95% CI
p-value
(Intercept)
12
5.6, 19
<0.001
temp
-0.22
-0.33, -0.12
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Let’s fit an equivalent (weighted) Bernoulli model. Not surprisingly, we get identical estimate.
obs_wt =c(rbind(orings$damage, 6- orings$damage))(orings_long = orings %>%slice(rep(1:n(), each =2)) %>%# replicate each row twicemutate(damage =rep(c(1, 0), 23)) %>%mutate(obs_wt = obs_wt))
glm(damage ~ temp, weights = obs_wt, family = binomial, data = orings_long) %>%tbl_regression(intercept =TRUE)
Characteristic
log(OR)
95% CI
p-value
(Intercept)
12
5.6, 19
<0.001
temp
-0.22
-0.33, -0.12
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
5 Prediction
Now we can predict the failure probability. The failure probability at 31F is nearly 1.
tibble(temp =seq(25, 85, 0.1), predprob =predict(lmod, newdata =tibble(temp = temp), type ="response")) %>%ggplot() +geom_line(mapping =aes(x = temp, y = predprob)) +geom_vline(xintercept =31) +labs(x ="Temperature (F)", y ="Predicted failure probability")
6 Goodness of fit
To evaluate the goodness of fit of model, we compare it to the full model with number of parameters equal to the number of observations. That is the full/saturated model estimates \(p_i\) by \(y_i/m_i\). Therefore the deviance is \[\begin{eqnarray*}
D &=& 2 \sum_i y_i \log(y_i/m_i) + (m_i - y_i) \log(1 - y_i / m_i) \\
& & - 2 \sum_i y_i \log(\widehat{p}_i) + (m_i - y_i) \log(1 - \widehat{p}_i) \\
&=& 2 \sum_i y_i \log(y_i / \widehat{y}_i) + (m_i - y_i) \log(m_i - y_i)/(m_i - \widehat{y}_i),
\end{eqnarray*}\] where \(\widehat{y}_i\) are the fitted values from the model.
When \(Y\) is truely binomial and \(m_i\) are relatively large, the deviance \(D\) is approximately \(\chi_{n-q-1}^2\) if the model is correct. A rule of thumb to use this asymptotic approximation is \(m_i \ge 5\). The large p-value indicates that the model has an adequate fit.
This large p-value indicates that this model fits data well.
We can define the Perason residuals as \[
r_i^{\text{P}} = \frac{y_i - n_i \widehat{p}_i}{\sqrt{\operatorname{Var} \widehat{y}_i}}.
\] Then \[
X^2 = \sum_i \left(r_i^{\text{P}} \right)^2
\] in analogy to \(\text{RSS} = \sum_i r_i^2\) in linear regression.
If we detect a lack of fit (unusual large deviance or Pearson \(X^2\)), we consider following causes.
Wrong systematic component. Most common explanation is we have the wrong structural form for the model. We have not included the right predictors or we have not transformed or combined them in the correct way.
Outliers. A few outliers with large deviance. If there are many, then we may have assumed the wrong error distribution for \(Y\).
Sparse data. If binomial batch sizes are all small, then the \(\chi^2\) approximation is bad. Rule of thumb for using the \(\chi^2\) approximation is \(m_i \ge 5\) for all batches.
Overdisperson or underdisperson.
Binomial distribution arises when we assume that the trials in a batch are independent and share the same “success” probability.
Are the failures of those 6 O-rings in a space shuttle really independent of each other?
Are the failure probability of those 6 O-rings in a space shuttle same?
Violation of the iid (identically and independently distributed) assumption of Bernoulli trials can lead to inflation of variance. (Exercise: compare the variance of beta-binomial to that of binomial.)
We can relax the assumption of binomial distribution by allowing an overdispersion parameter\[
\operatorname{Var} Y = \sigma^2 m p (1 - p).
\] The dispersion parameter \(\sigma^2\) can be estimated by \[
\widehat{\sigma}^2 = \frac{X^2}{n - q - 1},
\] where \(q\) is the number of parameters consumed by the logistic regression. Estimation of \(\boldsymbol{\beta}\) is unchanged but the inference is changed \[
\operatorname{Var} \widehat{\boldsymbol{\beta}} = \widehat{\sigma}^2 (\mathbf{X}^T \widehat{\mathbf{W}} \mathbf{X})^{-1}.
\]
Analysis of deviance by differences in deviance cannot be used because now the test statistic is distributed as \(\sigma^2 \chi^2\). Since \(\sigma^2\) is estimated, we can use \(F\) test by comparing \[
F = \frac{(D_{\text{small}} - D_{\text{large}})/ (\text{df}_{\text{small}} - \text{df}_{\text{large}})}{\widehat{\sigma}^2}
\] to the (asymptotic) F distribution with degrees of freedom \(\text{df}_{\text{small}} - \text{df}_{\text{large}}\) and \(n-q\).
The troutegg data set records the number of surviving eggs at 5 stream locations and retrieved at 4 different times.
A deviance value of 64.4951215 on 12 suggests a lack of fit. Diagnostics (not done here) does not reveal outliers. We estimate the dispersion parameter as
(sigma2 <-sum(residuals(bmod, type ="pearson")^2) /12)
[1] 5.330322
which is substantially larger than 1. We can now see the scaled up standard errors
summary(bmod, dispersion = sigma2)
Call:
glm(formula = cbind(survive, total - survive) ~ location + period,
family = "binomial", data = troutegg)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.6358 0.6495 7.138 9.49e-13 ***
location2 -0.4168 0.5682 -0.734 0.4632
location3 -1.2421 0.5066 -2.452 0.0142 *
location4 -0.9509 0.5281 -1.800 0.0718 .
location5 -4.6138 0.5777 -7.987 1.39e-15 ***
period7 -2.1702 0.5504 -3.943 8.05e-05 ***
period8 -2.3256 0.5609 -4.146 3.38e-05 ***
period11 -2.4500 0.5405 -4.533 5.82e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 5.330322)
Null deviance: 1021.469 on 19 degrees of freedom
Residual deviance: 64.495 on 12 degrees of freedom
AIC: 157.03
Number of Fisher Scoring iterations: 5
and make F tests on predictors
drop1(bmod, scale = sigma2, test ="F")
Warning in drop1.glm(bmod, scale = sigma2, test = "F"): F test assumes
'quasibinomial' family
Single term deletions
Model:
cbind(survive, total - survive) ~ location + period
scale: 5.330322
Df Deviance AIC F value Pr(>F)
<none> 64.50 157.03
location 4 913.56 308.32 39.494 8.142e-07 ***
period 3 228.57 181.81 10.176 0.001288 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This dispersion parameter method is only appropriate when the covariate classes are roughly equal in size. If not, more sophisticated methods should be used. One such approach uses the beta-binomial distribution where we assume that \(p\) follows a beta distribution, e.g., dispmod package of Scrucca (2012) in R implements this approach.
10 Quasi-binomial
Another way to deal with overdispersion or underdispsersion is quasi-binomial model.
\(U_i\) acts like the derivative of a log-likelihood (score function). Thus we define \[
Q_i = \int_{y_i}^{\mu_i} \frac{y_i - t}{\phi V(t)} \, dt
\] and treat \[
Q = \sum_{i=1}^n Q_i
\] as a log quasi-likelihood.
\(\widehat{\beta}\) is obtained by maximizing \(Q\) and \[
\widehat{\phi} = \frac{X^2}{n - p}.
\]
Let’s revisit the troutegg example.
modl <-glm(survive / total ~ location + period, family ="quasibinomial", data = troutegg)summary(modl)
Call:
glm(formula = survive/total ~ location + period, family = "quasibinomial",
data = troutegg)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.4915 0.6133 7.324 9.17e-06 ***
location2 -0.3677 0.5574 -0.660 0.52191
location3 -1.2027 0.5055 -2.379 0.03481 *
location4 -0.9569 0.5166 -1.852 0.08874 .
location5 -4.4356 0.5529 -8.022 3.66e-06 ***
period7 -2.0660 0.5137 -4.022 0.00169 **
period8 -2.2046 0.5141 -4.288 0.00105 **
period11 -2.3426 0.5143 -4.555 0.00066 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.04670695)
Null deviance: 8.88362 on 19 degrees of freedom
Residual deviance: 0.58247 on 12 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 6
Individual predictor significance is
drop1(modl, test ="F")
Single term deletions
Model:
survive/total ~ location + period
Df Deviance F value Pr(>F)
<none> 0.5825
location 4 7.9985 38.196 9.788e-07 ***
period 3 2.0545 10.109 0.001325 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see similar p-values as the overdispersion model.
11 Beta regression
We may also directly model the proportions as a beta distribution \(\text{Beta}(\alpha, \beta)\) with density \[
f(y \mid \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} y^{\alpha - 1}(1 - y)^{\beta - 1}.
\] If \(Y \sim \text{Beta}(\alpha, \beta)\), then \[
\mathbb{E}(Y) = \mu = \frac{\alpha}{\alpha + \beta}, \quad \operatorname{Var}(Y) = \frac{\alpha \beta}{(\alpha + \beta)^2(1 + \alpha + \beta)}.
\] Then we can link mean \(\mu\) to the linear predictor \(\eta\) using any link function for binomial model.
library(mgcv)modb <-gam(survive / total ~ location + period,family =betar(), data = troutegg)summary(modb)