Binomial and Proportion Responses (ELMR Chapter 3)

Biostat 200C

Author

Dr. Jin Zhou @ UCLA

Published

April 14, 2026

Display system information and load tidyverse and faraway packages

rm(list=ls())
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.2    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.5.2       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.50        jsonlite_2.0.0    xfun_0.56        
[13] digest_0.6.37     rlang_1.1.7       evaluate_1.0.5   
library(tidyverse)
library(faraway)
library(gtsummary)

faraway package contains the datasets in the ELMR book.

1 O-ring example

  • In January 1986, the space shuttle Challenger exploded 73 seconds after launch.

  • The culprit is the O-ring seals in the rocket boosters. At lower temperatures, rubber becomes more brittle and is a less effective sealant. At the time of the launch, the temperature was 31°F.

  • Could the failure of the O-rings have been predicted?

  • Data: 23 previous shuttle missions. Each shuttle had 2 boosters, each with 3 O-rings. We know the number of O-rings out of 6 showing some damage and the launch temperature.

orings <- orings %>%
  as_tibble(rownames = "mission") %>%
  print(n = Inf)
# A tibble: 23 × 3
   mission  temp damage
   <chr>   <dbl>  <dbl>
 1 1          53      5
 2 2          57      1
 3 3          58      1
 4 4          63      1
 5 5          66      0
 6 6          67      0
 7 7          67      0
 8 8          67      0
 9 9          68      0
10 10         69      0
11 11         70      1
12 12         70      0
13 13         70      1
14 14         70      0
15 15         72      0
16 16         73      0
17 17         75      0
18 18         75      1
19 19         76      0
20 20         76      0
21 21         78      0
22 22         79      0
23 23         81      0

2 Descriptive statistics

There seems a pattern: lower temperature, more damages.

oring_plot <- orings %>%
  mutate(failrate = damage / 6) %>%
  ggplot(mapping = aes(x = temp, y = failrate)) + 
  geom_point() + 
  labs(x = "Temperature (F)", y = "Proportion of damages", title = "O-ring damage data")
plot(oring_plot)

3 Binomial model

  • We model \(Y_i\) as a Binomial random variable with batch size \(m_i\) and “success” probability \(p_i\) \[ \mathbb{P}(Y_i = y_i) = \binom{m_i}{y_i} p_i^{y_i} (1 - p_i)^{m_i - y_i}. \]

  • The parameter \(p_i\) is linked to the predictors \(X_1, \ldots, X_{q}\) via an inverse link function \[ p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}}, \] where \(\eta_i\) is the linear predictor or systematic component \[ \eta_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{q} x_{iq} = \mathbf{x}_i^T \boldsymbol{\beta} \]

  • The log-likelihood is \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \left[ y_i \log p_i + (m_i - y_i) \log (1 - p_i) + \log \binom{m_i}{y_i} \right] \\ &=& \sum_{i=1}^n \left[ y_i \eta_i - m_i \log ( 1 + e^{\eta_i}) + \log \binom{m_i}{y_i} \right] \\ &=& \sum_{i=1}^n \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - m_i \log ( 1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) + \log \binom{m_i}{y_i} \right]. \end{eqnarray*}\]

  • The Bernoulli model in ELMR 2 is a special case with all batch sizes \(m_i = 1\).

  • Conversely, the Binomial model is equivalent to a Bernoulli model with \(\sum_i m_i\) observations, or a Bernoulli model with observation weights \((y_i, m_i - y_i)\).

4 Logistic regression

For Binomial model, we input \((\text{successes}_i, \text{failures}_i)\) as responses.

lmod <- glm(cbind(damage, 6 - damage) ~ temp, family = binomial, data = orings)
lmod %>%
  tbl_regression(intercept = TRUE)
Characteristic log(OR) 95% CI p-value
(Intercept) 12 5.6, 19 <0.001
temp -0.22 -0.33, -0.12 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Let’s fit an equivalent (weighted) Bernoulli model. Not surprisingly, we get identical estimate.

obs_wt = c(rbind(orings$damage, 6 - orings$damage))
(orings_long = orings %>%
  slice(rep(1:n(), each = 2)) %>% # replicate each row twice
  mutate(damage = rep(c(1, 0), 23)) %>%
  mutate(obs_wt = obs_wt)) 
# A tibble: 46 × 4
   mission  temp damage obs_wt
   <chr>   <dbl>  <dbl>  <dbl>
 1 1          53      1      5
 2 1          53      0      1
 3 2          57      1      1
 4 2          57      0      5
 5 3          58      1      1
 6 3          58      0      5
 7 4          63      1      1
 8 4          63      0      5
 9 5          66      1      0
10 5          66      0      6
# ℹ 36 more rows
glm(damage ~ temp, weights = obs_wt, family = binomial, data = orings_long) %>%
  tbl_regression(intercept = TRUE)
Characteristic log(OR) 95% CI p-value
(Intercept) 12 5.6, 19 <0.001
temp -0.22 -0.33, -0.12 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

5 Prediction

Now we can predict the failure probability. The failure probability at 31F is nearly 1.

tibble(temp     = seq(25, 85, 0.1), 
       predprob = predict(lmod, newdata = tibble(temp = temp), type = "response")) %>%
  ggplot() + 
  geom_line(mapping = aes(x = temp, y = predprob)) + 
  geom_vline(xintercept = 31) +
  labs(x = "Temperature (F)", y = "Predicted failure probability")

6 Goodness of fit

  • To evaluate the goodness of fit of model, we compare it to the full model with number of parameters equal to the number of observations. That is the full/saturated model estimates \(p_i\) by \(y_i/m_i\). Therefore the deviance is \[\begin{eqnarray*} D &=& 2 \sum_i y_i \log(y_i/m_i) + (m_i - y_i) \log(1 - y_i / m_i) \\ & & - 2 \sum_i y_i \log(\widehat{p}_i) + (m_i - y_i) \log(1 - \widehat{p}_i) \\ &=& 2 \sum_i y_i \log(y_i / \widehat{y}_i) + (m_i - y_i) \log(m_i - y_i)/(m_i - \widehat{y}_i), \end{eqnarray*}\] where \(\widehat{y}_i\) are the fitted values from the model.

  • When \(Y\) is truely binomial and \(m_i\) are relatively large, the deviance \(D\) is approximately \(\chi_{n-q-1}^2\) if the model is correct. A rule of thumb to use this asymptotic approximation is \(m_i \ge 5\). The large p-value indicates that the model has an adequate fit.

pchisq(lmod$deviance, lmod$df.residual, lower = FALSE)
[1] 0.7164099
  • Significance of the predictor temp can be evaluated using a similar analysis of deviance test.
drop1(lmod, test = "Chi")
Single term deletions

Model:
cbind(damage, 6 - damage) ~ temp
       Df Deviance    AIC    LRT  Pr(>Chi)    
<none>      16.912 33.675                     
temp    1   38.898 53.660 21.985 2.747e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7 Pearson \(X^2\)

  • The Pearson \(X^2\) statistic takes the form \[ X^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}. \] In the case of binomial model, we get \[\begin{eqnarray*} X^2 &=& \sum_i \frac{(y_i - n_i \widehat{p}_i)^2}{n_i \widehat{p}_i} + \frac{[(n_i - y_i) - n_i (1 - \widehat{p}_i)]^2}{n_i (1 - \widehat{p}_i)} \\ &=& \sum_i \frac{(y_i - n_i \widehat{p}_i)^2}{n_i \widehat{p}_i (1 - \widehat{p}_i)}. \end{eqnarray*}\] Comparing Pearson \(X^2\) statistic to the asymptotic null distribution \(\chi_{n - q - 1}^2\) gives a goodness of fit test.
predprob <- predict(lmod, type = "response")
# Pearson X2 statistic
(px2 <- sum((orings$damage - 6 * predprob)^2 / (6 * predprob * (1 - predprob))))
[1] 28.06738
pchisq(px2, lmod$df.residual, lower.tail = FALSE)
[1] 0.1382507

This large p-value indicates that this model fits data well.

  • We can define the Perason residuals as \[ r_i^{\text{P}} = \frac{y_i - n_i \widehat{p}_i}{\sqrt{\operatorname{Var} \widehat{y}_i}}. \] Then \[ X^2 = \sum_i \left(r_i^{\text{P}} \right)^2 \] in analogy to \(\text{RSS} = \sum_i r_i^2\) in linear regression.
(presid <- residuals(lmod, type = "pearson"))
         1          2          3          4          5          6          7 
 1.3928147 -0.8972668 -0.6821441  0.3214099 -0.6647553 -0.5966330 -0.5966330 
         8          9         10         11         12         13         14 
-0.5966330 -0.5354916 -0.4806159  1.9587595 -0.4313637  1.9587595 -0.4313637 
        15         16         17         18         19         20         21 
-0.3474838 -0.3118746 -0.2512296  3.7710639 -0.2254843 -0.2254843 -0.1816382 
        22         23 
-0.1630244 -0.1313239 

8 Diagnostics

  • Plot deviance residuals against linear predictors to identify potential high residual observations.
orings %>%
  mutate(devres  = residuals(lmod, type = "deviance"),
         linpred = predict(lmod, type = "link")) %>%
  ggplot + 
  geom_point(mapping = aes(x = linpred, y = devres)) +
  labs(x = "Linear predictor", y = "Deviance residual")

  • Plot sorted hat values against half-normal quantiles to identify potential high leverage observations.
halfnorm(hatvalues(lmod))

orings %>%
  slice(c(1, 2))
# A tibble: 2 × 3
  mission  temp damage
  <chr>   <dbl>  <dbl>
1 1          53      5
2 2          57      1
  • Plot sorted Cook distances against the half-normal quantiles to identify potential high influential observations.
halfnorm(cooks.distance(lmod))

orings %>%
  slice(c(1, 18))
# A tibble: 2 × 3
  mission  temp damage
  <chr>   <dbl>  <dbl>
1 1          53      5
2 18         75      1

9 Overdispersion

  • If we detect a lack of fit (unusual large deviance or Pearson \(X^2\)), we consider following causes.

    1. Wrong systematic component. Most common explanation is we have the wrong structural form for the model. We have not included the right predictors or we have not transformed or combined them in the correct way.

    2. Outliers. A few outliers with large deviance. If there are many, then we may have assumed the wrong error distribution for \(Y\).

    3. Sparse data. If binomial batch sizes are all small, then the \(\chi^2\) approximation is bad. Rule of thumb for using the \(\chi^2\) approximation is \(m_i \ge 5\) for all batches.

    4. Overdisperson or underdisperson.

  • Binomial distribution arises when we assume that the trials in a batch are independent and share the same “success” probability.

    • Are the failures of those 6 O-rings in a space shuttle really independent of each other?

    • Are the failure probability of those 6 O-rings in a space shuttle same?

  • Violation of the iid (identically and independently distributed) assumption of Bernoulli trials can lead to inflation of variance. (Exercise: compare the variance of beta-binomial to that of binomial.)

  • We can relax the assumption of binomial distribution by allowing an overdispersion parameter \[ \operatorname{Var} Y = \sigma^2 m p (1 - p). \] The dispersion parameter \(\sigma^2\) can be estimated by \[ \widehat{\sigma}^2 = \frac{X^2}{n - q - 1}, \] where \(q\) is the number of parameters consumed by the logistic regression. Estimation of \(\boldsymbol{\beta}\) is unchanged but the inference is changed \[ \operatorname{Var} \widehat{\boldsymbol{\beta}} = \widehat{\sigma}^2 (\mathbf{X}^T \widehat{\mathbf{W}} \mathbf{X})^{-1}. \]

  • Analysis of deviance by differences in deviance cannot be used because now the test statistic is distributed as \(\sigma^2 \chi^2\). Since \(\sigma^2\) is estimated, we can use \(F\) test by comparing \[ F = \frac{(D_{\text{small}} - D_{\text{large}})/ (\text{df}_{\text{small}} - \text{df}_{\text{large}})}{\widehat{\sigma}^2} \] to the (asymptotic) F distribution with degrees of freedom \(\text{df}_{\text{small}} - \text{df}_{\text{large}}\) and \(n-q\).

  • The troutegg data set records the number of surviving eggs at 5 stream locations and retrieved at 4 different times.

troutegg
   survive total location period
1       89    94        1      4
2      106   108        2      4
3      119   123        3      4
4      104   104        4      4
5       49    93        5      4
6       94    98        1      7
7       91   106        2      7
8      100   130        3      7
9       80    97        4      7
10      11   113        5      7
11      77    86        1      8
12      87    96        2      8
13      88   119        3      8
14      67    99        4      8
15      18    88        5      8
16     141   155        1     11
17     104   122        2     11
18      91   125        3     11
19     111   132        4     11
20       0   138        5     11

We fit a binomial model for the two main effects.

library(gtsummary)
bmod <- glm(cbind(survive, total - survive) ~ location + period,
            family = "binomial", data = troutegg)
#bmod %>%
#  tbl_regression(exponentiate = TRUE) %>%
#  bold_labels()
summary(bmod)

Call:
glm(formula = cbind(survive, total - survive) ~ location + period, 
    family = "binomial", data = troutegg)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   4.6358     0.2813  16.479  < 2e-16 ***
location2    -0.4168     0.2461  -1.694   0.0903 .  
location3    -1.2421     0.2194  -5.660 1.51e-08 ***
location4    -0.9509     0.2288  -4.157 3.23e-05 ***
location5    -4.6138     0.2502 -18.439  < 2e-16 ***
period7      -2.1702     0.2384  -9.103  < 2e-16 ***
period8      -2.3256     0.2429  -9.573  < 2e-16 ***
period11     -2.4500     0.2341 -10.466  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1021.469  on 19  degrees of freedom
Residual deviance:   64.495  on 12  degrees of freedom
AIC: 157.03

Number of Fisher Scoring iterations: 5

Analysis of deviance for the significance of two factors

drop1(bmod, test = "Chisq")
Single term deletions

Model:
cbind(survive, total - survive) ~ location + period
         Df Deviance    AIC    LRT  Pr(>Chi)    
<none>         64.50 157.03                     
location  4   913.56 998.09 849.06 < 2.2e-16 ***
period    3   228.57 315.11 164.08 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • A deviance value of 64.4951215 on 12 suggests a lack of fit. Diagnostics (not done here) does not reveal outliers. We estimate the dispersion parameter as
(sigma2 <- sum(residuals(bmod, type = "pearson")^2) / 12)
[1] 5.330322

which is substantially larger than 1. We can now see the scaled up standard errors

summary(bmod, dispersion = sigma2)

Call:
glm(formula = cbind(survive, total - survive) ~ location + period, 
    family = "binomial", data = troutegg)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   4.6358     0.6495   7.138 9.49e-13 ***
location2    -0.4168     0.5682  -0.734   0.4632    
location3    -1.2421     0.5066  -2.452   0.0142 *  
location4    -0.9509     0.5281  -1.800   0.0718 .  
location5    -4.6138     0.5777  -7.987 1.39e-15 ***
period7      -2.1702     0.5504  -3.943 8.05e-05 ***
period8      -2.3256     0.5609  -4.146 3.38e-05 ***
period11     -2.4500     0.5405  -4.533 5.82e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 5.330322)

    Null deviance: 1021.469  on 19  degrees of freedom
Residual deviance:   64.495  on 12  degrees of freedom
AIC: 157.03

Number of Fisher Scoring iterations: 5

and make F tests on predictors

drop1(bmod, scale = sigma2, test = "F")
Warning in drop1.glm(bmod, scale = sigma2, test = "F"): F test assumes
'quasibinomial' family
Single term deletions

Model:
cbind(survive, total - survive) ~ location + period

scale:  5.330322 

         Df Deviance    AIC F value    Pr(>F)    
<none>         64.50 157.03                      
location  4   913.56 308.32  39.494 8.142e-07 ***
period    3   228.57 181.81  10.176  0.001288 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • This dispersion parameter method is only appropriate when the covariate classes are roughly equal in size. If not, more sophisticated methods should be used. One such approach uses the beta-binomial distribution where we assume that \(p\) follows a beta distribution, e.g., dispmod package of Scrucca (2012) in R implements this approach.

10 Quasi-binomial

  • Another way to deal with overdispersion or underdispsersion is quasi-binomial model.

  • Assume \[\begin{eqnarray*} \mathbb{E} (Y_i) = \mu_i, \quad \operatorname{Var}(Y_i) = \phi V(\mu_i). \end{eqnarray*}\] Define a score \[ U_i = \frac{Y_i - \mu_i}{\phi V(\mu_i)}, \] with \[\begin{eqnarray*} \mathbb{E}(U_i) &=& 0, \quad \operatorname{Var}(U_i) = \frac{1}{\phi V(\mu_i)} \\ - \mathbb{E} \frac{\partial U_i}{\partial \mu_i} &=& - \mathbb{E} \frac{- \phi V(\mu_i) - (Y_i - \mu_i) \phi V'(\mu_i)}{[\phi V(\mu_i)]^2} = \frac{1}{\phi V(\mu_i)}. \end{eqnarray*}\] Exercise: Verify that the log-likilihood \(\ell_i\) of a binomial proportion \(Y_i\), where \(m_i Y_i \sim \text{Bin}(m_i, p_i)\), satisfies
    \[\begin{eqnarray*} \mathbb{E} \frac{\partial \ell_i}{\partial \mu_i} &=& 0 \\ \operatorname{var} \frac{\partial \ell_i}{\partial \mu_i} &=& \frac{1}{\phi V(\mu_i)} \\ \mathbb{E} \frac{\partial \ell_i^2}{\partial^2 \mu_i} &=& - \frac{1}{\phi V(\mu_i)}, \end{eqnarray*}\] with \(\phi = 1\), \(\mu_i = p_i\), and \(V(\mu_i)=p_i (1 - p_i)/m_i\).

  • \(U_i\) acts like the derivative of a log-likelihood (score function). Thus we define \[ Q_i = \int_{y_i}^{\mu_i} \frac{y_i - t}{\phi V(t)} \, dt \] and treat \[ Q = \sum_{i=1}^n Q_i \] as a log quasi-likelihood.

  • \(\widehat{\beta}\) is obtained by maximizing \(Q\) and \[ \widehat{\phi} = \frac{X^2}{n - p}. \]

  • Let’s revisit the troutegg example.

modl <- glm(survive / total ~ location + period, 
            family = "quasibinomial", data = troutegg)
summary(modl)

Call:
glm(formula = survive/total ~ location + period, family = "quasibinomial", 
    data = troutegg)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.4915     0.6133   7.324 9.17e-06 ***
location2    -0.3677     0.5574  -0.660  0.52191    
location3    -1.2027     0.5055  -2.379  0.03481 *  
location4    -0.9569     0.5166  -1.852  0.08874 .  
location5    -4.4356     0.5529  -8.022 3.66e-06 ***
period7      -2.0660     0.5137  -4.022  0.00169 ** 
period8      -2.2046     0.5141  -4.288  0.00105 ** 
period11     -2.3426     0.5143  -4.555  0.00066 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.04670695)

    Null deviance: 8.88362  on 19  degrees of freedom
Residual deviance: 0.58247  on 12  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

Individual predictor significance is

drop1(modl, test = "F")
Single term deletions

Model:
survive/total ~ location + period
         Df Deviance F value    Pr(>F)    
<none>        0.5825                      
location  4   7.9985  38.196 9.788e-07 ***
period    3   2.0545  10.109  0.001325 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see similar p-values as the overdispersion model.

11 Beta regression

  • We may also directly model the proportions as a beta distribution \(\text{Beta}(\alpha, \beta)\) with density \[ f(y \mid \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} y^{\alpha - 1}(1 - y)^{\beta - 1}. \] If \(Y \sim \text{Beta}(\alpha, \beta)\), then \[ \mathbb{E}(Y) = \mu = \frac{\alpha}{\alpha + \beta}, \quad \operatorname{Var}(Y) = \frac{\alpha \beta}{(\alpha + \beta)^2(1 + \alpha + \beta)}. \] Then we can link mean \(\mu\) to the linear predictor \(\eta\) using any link function for binomial model.
library(mgcv)
modb <- gam(survive / total ~ location + period,
            family = betar(), data = troutegg)
summary(modb)

Family: Beta regression(7.992) 
Link function: logit 

Formula:
survive/total ~ location + period

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   4.8507     0.5757   8.426  < 2e-16 ***
location2    -0.3353     0.5961  -0.563   0.5738    
location3    -0.9843     0.5739  -1.715   0.0863 .  
location4    -0.3002     0.5975  -0.502   0.6153    
location5    -5.3264     0.6241  -8.534  < 2e-16 ***
period7      -2.7315     0.5614  -4.865 1.14e-06 ***
period8      -2.9455     0.5551  -5.306 1.12e-07 ***
period11     -3.4724     0.5410  -6.419 1.37e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


R-sq.(adj) =  0.891   Deviance explained =  110%
-REML = -67.01  Scale est. = 1         n = 20