Review of Analysis of Deviance

Biostat 200C

Author

Dr. Jin Zhou @ UCLA

Published

April 8, 2025

Display system information and load tidyverse and faraway packages

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.1    fastmap_1.2.0     cli_3.6.4        
 [5] tools_4.4.1       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.49        jsonlite_1.9.1    xfun_0.51        
[13] digest_0.6.37     rlang_1.1.5       evaluate_1.0.3

library(tidyverse)
library(faraway)

1 Log-Likelihood ratio

A saturated model
- If there are \(n\) observations \(Y_i\), \(i=1,\ldots, n\), all with potentially different values for the linear component \(X_i'\boldsymbol{\beta}\), then we can specify a model with \(n\) parameters to let the predicted \(Y_i\) be the same as observed \(Y_i\).
- If some of the observations correspond to the same linear combinations of the predictors, the number of parameters in the saturated model can be less than \(n\).
- We use \(m\) to denote the number of parameters can be estimated in a saturated model.
- Let \(\boldsymbol{\beta}_{max}\) be the parameter vector from the saturated model.
- Let \(\mathbf{b}_{max}\) be MLE from the saturated model and \(L(\mathbf{b}_{max})\) be the likelihood evaluated at \(\mathbf{b}_{max}\), which is larger than any other likelihood function.
Let \(L(\mathbf{b})\) be the maximum value of the likelihood function fo the model of interest.
Then the likelihood ratio is \[ \lambda = \frac{L(\mathbf{b}_{max})}{L(\mathbf{b})} \] provides a way of assessing the goodness of fit for the model.
In terms of log-likelihood \[ \log \lambda = \ell(\mathbf{b}_{max})- \ell(\mathbf{b}) \]

2 Deviance

The Deviance, also called likelihood (ratio) statistic, is

\[ D = 2 \log \frac{L (\mathbf{b}_{max})}{L(\mathbf{b})} = 2[\ell(\mathbf{b}_{max})- \ell(\mathbf{b})] \]

Using Taylor expansion, one can prove the sample distribution of the deviance is approximately \[ D\sim \chi^2(m-p, \nu) \]

3 Example: Deviance for a Normal linear model

In linear models with normality assumption, for \(i=1,\ldots, n\) \[ \mbox{E}(Y_i) = \mu_i = \mathbf{x}_i'\boldsymbol{\beta}; \quad Y_i \sim N(\mu_i, \sigma^2) \] where \(Y_i\) are independent. The log-likelihood function is \[ \ell(\mathbf{\beta}) = -\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\mu_i)^2-\frac{n}{2}\log(2\pi\sigma^2) \]

For a saturated model, we use \(y_i\) as the predicted value. Therefore \[ \ell(\mathbf{b}_{max}) = -\frac{n}{2}\log(2\pi \sigma^2) \]
For a proposed model with \(p<n\) parameters, the maximum likelihood estimators are \[ b = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} \]
- The corresponding maximum value of log-likelihood is \[ \ell(\mathbf{b}) = -\frac{n}{2}\log(2\pi\sigma^2) -\sum_{i=1}^n \frac{1}{2\sigma^2}(y_i -\mathbf{X}_i' \mathbf{b})^2. \]
Therefore the Deviance is \[ D = 2(\ell(\mathbf{b}_{max}) - \ell(\mathbf{b})) = -\sum_{i=1}^n \frac{1}{\sigma^2}(y_i -\mathbf{X}_i' \mathbf{b})^2 = \frac{1}{\sigma^2} (\mathbf{y}-\mathbf{X}\mathbf{b})'(\mathbf{y}-\mathbf{X}\mathbf{b}) \] And we know \[ \mathbf{y}-\mathbf{X}\mathbf{b} = [\mathbf{I}-\mathbf{H}]\mathbf{y} \] where \(H = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) and \[ (\mathbf{y}-\mathbf{X}\mathbf{b})'(\mathbf{y}-\mathbf{X}\mathbf{b}) = \mathbf{y}'(\mathbf{I}-\mathbf{H})\mathbf{y} \] Since the rank of \(\mathbf{I}-\mathbf{H}\) is \(n-p\), \(D\) has a chi-squared distribution with \(n-p\) degrees of freedom and non-centrality parameter \((\mathbf{X}\boldsymbol{\beta})'(\mathbf{I}-\mathbf{H})(\mathbf{X}\boldsymbol{\beta})/\sigma^2\). But \((\mathbf{I}-\mathbf{H})\mathbf{X} = 0\), therefore \(D\) has the central distribution \(\chi^2(n-p)\) exactly.
The term scaled deviance is sometimes used for \[ \sigma^2D = \sum_{i=1}^n(y_i-\widehat{\mu}_i)^2 \] where \(\widehat{\mu}_i= \mathbf{X}_i' \mathbf{b}\). So if the model fits data well, then \(D\sim \chi^2(n-p)\). The expected value for a random variable with \(\chi^2(n-p)\) distribution is \(n-p\), so the expected value of \(D\) is \(n-p\).
This provides an estimate of \(\sigma^2\) as \[ \widehat{\sigma}^2 = \frac{\sum_{i=1}^{n}(y_i-\widehat{\mu}_i)^2}{n-p} \] or in our lecture note, we said: “An unbiased estimate of the error variance \(\sigma^2\) is” \[ \widehat{\sigma} = \sqrt{\frac{\text{RSS}}{\text{df}}} \] Some program output the scaled deviance for a Normal linear model and call \(\widehat{\sigma}^2\) the scale parameter.
The deviance is also related to the sume of squares of the standardized residulas \[ \sum_{i=1}^m r_i^2 = \frac{1}{\widehat{\sigma}^2}\sum_{i=1}^n(y_i-\widehat{\mu}_i)^2 \] where \(\widehat{\sigma}^2\) is an estimate of \(\sigma^2\). This provides a rough rule of thumb for the overall magnitude of the standardized residuals. If the model fits well so that \(D\sim \chi^2(n-p)\), you would expect \(\sum_{i=1}^m r_i^2 = n-p\), approximately.

3.1 Voting data example

gavote <- gavote %>%
  mutate(undercount = (ballots - votes) / ballots) %>%
  rename(usage = rural) %>%
  mutate(pergore = gore / votes, 
         perbush = bush / votes)

lmod <- lm(undercount ~ pergore + perAA, gavote)
sumlmod <- summary(lmod)

Scaled deviance, i.e., output from deviance, is \(\sigma^2D\)

RSS <- sum(residuals(sumlmod)^2)
RSS

[1] 0.09324918

deviance(lmod)

[1] 0.09324918

\(\widehat{\sigma}\), i.e., Residual standard error from lm fit

sigma = sqrt(RSS/df.residual(lmod))
sigma

[1] 0.02444895

sumlmod$sigma

[1] 0.02444895

4 Example: Deviance for a Bionomial model

If the response \(Y_1, \ldots, Y_n\) are independent and \(Y_i\sim \text{Bin}(m_i, p_i)\), then the log-likelihood is \[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[y_i\log p_i - y_i\log(1-p_i) + m_i\log(1-p_i) + \log {m_i\choose y_i}\right] \]
For a saturated model, the \(y_i\)’s are independent. \(p_i\in (0,1)\) has no restriction. There are \(n\) parameters, i.e., \(\boldsymbol{\beta} = (p_1,\ldots, p_n)'\). MLE is \(\mathbf{b}_{max} = y_i/m_i\)
For a proposed logistic model, the \(y_i\)’s are independent and \[ p_i = \frac{e^{\mathbf{x_i}'\boldsymbol{\beta}}}{1+e^{\mathbf{x_i}'\boldsymbol{\beta}}} \] Therefore, there are \(q+1\) total number of parameters. And the MLE is \[ \hat p_i = \frac{e^{\mathbf{x_i}'\mathbf{b}}}{1+e^{\mathbf{x_i}'\mathbf{b}}} \]
So the deviance is
\[\begin{eqnarray*} D &=& 2 \sum_i y_i \log(y_i/m_i) + (m_i - y_i) \log(1 - y_i / m_i) \\ & & - 2 \sum_i y_i \log(\widehat{p}_i) + (m_i - y_i) \log(1 - \widehat{p}_i) \\ &=& 2 \sum_i y_i \log(y_i / \widehat{y}_i) + (m_i - y_i) \log(m_i - y_i)/(m_i - \widehat{y}_i), \end{eqnarray*}\] where \(\widehat{y}_i\) are the fitted values from the model.
When \(Y\) is truely binomial and \(m_i\) are relatively large, the deviance \(D\) is approximately \(\chi_{n-q-1}^2\) if the model is correct. A rule of thumb to use this asymptotic approximation is \(m_i \ge 5\). The large p-value indicates that the model has an adequate fit.
We can define the Pearson residuals as \[ r_i^{\text{P}} = \frac{y_i - n_i \widehat{p}_i}{\sqrt{\operatorname{Var} \widehat{y}_i}}. \] Then \[ X^2 = \sum_i \left(r_i^{\text{P}} \right)^2 \] in analogy to \(\text{RSS} = \sum_i r_i^2\) in linear regression.
Or we can define deviance residual similar in Binary outcome models \[ r_{dres_i} = \text{sign}(y_i - m_i\widehat{p}_i)\sqrt{y_i\ln\left(\frac{y}{m_i\widehat{p}_i}\right) + (m_i-y_i)\ln\left(\frac{m_i-y_i}{m_i-m_i\widehat{p}_i}\right)} \]

4.1 Data example

binmod <- glm(cbind(damage, 6 - damage) ~ temp, family = binomial, data = orings)
sumbinmod  <- summary(binmod)

sumbinmod


Call:
glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial, 
    data = orings)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 11.66299    3.29626   3.538 0.000403 ***
temp        -0.21623    0.05318  -4.066 4.78e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 38.898  on 22  degrees of freedom
Residual deviance: 16.912  on 21  degrees of freedom
AIC: 33.675

Number of Fisher Scoring iterations: 6

tibble(rperson = residuals(binmod, type = "pearson"),
  rdeviance = residuals(binmod, type = "deviance")) %>%
  ggplot() +
  geom_point(mapping = aes(x = rperson, y =rdeviance)) + 
  geom_abline(intercept = 0, slope = 1) + 
  xlab("Pearson residual") + 
  ylab("Deviance residual")