Bonus question, Binary Response (ELMR Chapter 2)

Biostat 200C

Author

Dr. Jin Zhou @ UCLA

Published

May 6, 2025

rm(list=ls())
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.1    fastmap_1.2.0     cli_3.6.4        
 [5] tools_4.4.1       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.49        jsonlite_1.9.1    xfun_0.51        
[13] digest_0.6.37     rlang_1.1.5       evaluate_1.0.3   
library(tidyverse)
library(faraway)
library(gtsummary)

1 Binomial model

  • We model \(Y_i\) as a Binomial random variable with batch size \(m_i\) and “success” probability \(p_i\) \[ \mathbb{P}(Y_i = y_i) = \binom{m_i}{y_i} p_i^{y_i} (1 - p_i)^{m_i - y_i}. \]

  • The parameter \(p_i\) is linked to the predictors \(X_1, \ldots, X_{q}\) via an inverse link function \[ p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}}, \] where \(\eta_i\) is the linear predictor or systematic component \[ \eta_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{q} x_{iq} = \mathbf{x}_i^T \boldsymbol{\beta} \]

  • The log-likelihood is \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \left[ y_i \log p_i + (m_i - y_i) \log (1 - p_i) + \log \binom{m_i}{y_i} \right] \\ &=& \sum_{i=1}^n \left[ y_i \eta_i - m_i \log ( 1 + e^{\eta_i}) + \log \binom{m_i}{y_i} \right] \\ &=& \sum_{i=1}^n \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - m_i \log ( 1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) + \log \binom{m_i}{y_i} \right]. \end{eqnarray*}\]

  • The Bernoulli model in ELMR 2 is a special case with all batch sizes \(m_i = 1\).

  • Conversely, the Binomial model is equivalent to a Bernoulli model with \(\sum_i m_i\) observations, or a Bernoulli model with observation weights \((y_i, m_i - y_i)\).

  • The Binomial model is an equivalent (weighted) Bernoulli model.

obs_wt = c(rbind(orings$damage, 6 - orings$damage))
(orings_long <- orings %>%
  slice(rep(1:n(), each = 2)) %>% # replicate each row twice
  mutate(damage = rep(c(1, 0), 23)) %>%
  mutate(obs_wt = obs_wt) %>%
  mutate(rn = row_number()) %>%
  rowwise() %>%
  slice(rep(1, obs_wt)) %>%
  select(-rn))
# A tibble: 138 × 3
# Rowwise: 
    temp damage obs_wt
   <dbl>  <dbl>  <dbl>
 1    53      1      5
 2    53      1      5
 3    53      1      5
 4    53      1      5
 5    53      1      5
 6    53      0      1
 7    57      1      1
 8    57      0      5
 9    57      0      5
10    57      0      5
# ℹ 128 more rows
glm(damage ~ temp, family = binomial, data = orings_long) %>%
  tbl_regression(intercept = TRUE)
Characteristic log(OR) 95% CI p-value
(Intercept) 12 5.6, 19 <0.001
temp -0.22 -0.33, -0.12 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  • We write out the log-likelihood for this reformatted data with binary outcome. Given \(n = \sum_{i} m_i\) (where \(m_i\) is the batch size) data points \((y_i, \mbox{temp}_i)\), \(i=1,\ldots,n\), the log-likelihood is \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_i \log \left[p_i^{y_i} (1 - p_i)^{1 - y_i}\right] \\ &=& \sum_i \left[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \right] \\ &=& \sum_i \left[ y_i \log \frac{e^{\eta_i}}{1 + e^{\eta_i}} + (1 - y_i) \log \frac{1}{1 + e^{\eta_i}} \right] \\ &=& \sum_i \left[ y_i \eta_i - \log (1 + e^{\eta_i}) \right] \\ &=& \sum_i \left[ y_i \cdot (\beta_0 + \beta_1 \mbox{temp}_i) - \log (1 + e^{\beta_0 + \beta_1 \mbox{temp}_i}) \right]. \end{eqnarray*}\]

  • Therefore,

\[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_i \left[ y_i \cdot (\beta_0 + \beta_1 \mbox{temp}_i) - \log (1 + e^{\beta_0 + \beta_1 \mbox{temp}_i}) \right]\\ &=& \mbox{obs\_wt}_1 * \left[ \mathbf{1} \cdot (\beta_0 + \beta_1 \mbox{temp}_1) - \log (1 + e^{\beta_0 + \beta_1 \mbox{temp}_1}) \right] + \\ && \mbox{obs\_wt}_2 * \left[ \mathbf{0} \cdot (\beta_0 + \beta_1 \mbox{temp}_1) - \log (1 + e^{\beta_0 + \beta_1 \mbox{temp}_1}) \right] + \\ && \cdots + \\ && \mbox{obs\_wt}_{45} * \left[ \mathbf{1} \cdot (\beta_0 + \beta_1 \mbox{temp}_{46}) - \log (1 + e^{\beta_0 + \beta_1 \mbox{temp}_{45}}) \right]\\ && \mbox{obs\_wt}_{46} * \left[ \mathbf{0} \cdot (\beta_0 + \beta_1 \mbox{temp}_{46}) - \log (1 + e^{\beta_0 + \beta_1 \mbox{temp}_{46}}) \right]. \end{eqnarray*}\]