Review of Analysis of Deviance

Log-Likelihood ratio
Deviance
Example: Deviance for a Normal linear model
- Voting data example
Example: Deviance for a Bionomial model
- Data example
Hypothesis Testing

Display system information and load tidyverse and faraway packages

sessionInfo()

## R version 4.2.3 (2023-03-15)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.31   R6_2.5.1        jsonlite_1.8.4  evaluate_0.21  
##  [5] cachem_1.0.8    rlang_1.1.1     cli_3.6.1       rstudioapi_0.14
##  [9] jquerylib_0.1.4 bslib_0.4.2     rmarkdown_2.21  tools_4.2.3    
## [13] xfun_0.39       yaml_2.3.7      fastmap_1.1.1   compiler_4.2.3 
## [17] htmltools_0.5.5 knitr_1.42      sass_0.4.6

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(faraway)

Log-Likelihood ratio

A saturated model
- If there are \(n\) observations \(Y_i\), \(i=1,\ldots, n\), all with potentially different values for the linear component \(X_i'\boldsymbol{\beta}\), then we can specify a model with \(n\) parameters to let the predicted \(Y_i\) be the same as observed \(Y_i\).
- If some of the observations correspond to the same linear combinations of the predictors, the number of parameters in the saturated model can be less than \(n\).
- We use \(m\) to denote the number of parameters can be estimated in a saturated model.
- Let \(\boldsymbol{\beta}_{max}\) be the parameter vector from the saturated model.
- Let \(\mathbf{b}_{max}\) be MLE from the saturated model and \(L(\mathbf{b}_{max})\) be the likelihood evaluated at \(\mathbf{b}_{max}\), which is larger than any other likelihood function.
Let \(L(\mathbf{b})\) be the maximum value of the likelihood function fo the model of interest.
Then the likelihood ratio is \[ \lambda = \frac{L(\mathbf{b}_{max})}{L(\mathbf{b})} \] provides a way of assessing the goodness of fit for the model.
In terms of log-likelihood \[ \log \lambda = \ell(\mathbf{b}_{max})- \ell(\mathbf{b}) \]

Deviance

The Deviance, also called likelihood (ratio) statistic, is

\[ D = 2 \log \frac{L (\mathbf{b}_{max})}{L(\mathbf{b})} = 2[\ell(\mathbf{b}_{max})- \ell(\mathbf{b})] \]

Using Taylor expansion, one can prove the sample distribution of the deviance is approximately \[ D\sim \chi^2(m-p, \nu) \]

Example: Deviance for a Normal linear model

In linear models with normality assumption, for \(i=1,\ldots, n\) \[ \mbox{E}(Y_i) = \mu_i = \mathbf{x}_i'\boldsymbol{\beta}; \quad Y_i \sim N(\mu_i, \sigma^2) \] where \(Y_i\) are independent. The log-likelihood function is \[ \ell(\mathbf{\beta}) = -\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\mu_i)^2-\frac{n}{2}\log(2\pi\sigma^2) \]

For a saturated model, we use \(y_i\) as the predicted value. Therefore \[ \ell(\mathbf{b}_{max}) = -\frac{n}{2}\log(2\pi \sigma^2) \]
For a proposed model with \(p<n\) parameters, the maximum likelihood estimators are \[ b = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} \]
- The corresponding maximum value of log-likelihood is \[ \ell(\mathbf{b}) = -\frac{n}{2}\log(2\pi\sigma^2) -\sum_{i=1}^n \frac{1}{2\sigma^2}(y_i -\mathbf{X}_i' \mathbf{b})^2. \]
Therefore the Deviance is \[ D = 2(\ell(\mathbf{b}_{max}) - \ell(\mathbf{b})) = -\sum_{i=1}^n \frac{1}{\sigma^2}(y_i -\mathbf{X}_i' \mathbf{b})^2 = \frac{1}{\sigma^2} (\mathbf{y}-\mathbf{X}\mathbf{b})'(\mathbf{y}-\mathbf{X}\mathbf{b}) \] And we know \[ \mathbf{y}-\mathbf{X}\mathbf{b} = [\mathbf{I}-\mathbf{H}]\mathbf{y} \] where \(H = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) and \[ (\mathbf{y}-\mathbf{X}\mathbf{b})'(\mathbf{y}-\mathbf{X}\mathbf{b}) = \mathbf{y}'(\mathbf{I}-\mathbf{H})\mathbf{y} \] Since the rank of \(\mathbf{I}-\mathbf{H}\) is \(n-p\), \(D\) has a chi-squared distribution with \(n-p\) degrees of freedom and non-centrality parameter \((\mathbf{X}\boldsymbol{\beta})'(\mathbf{I}-\mathbf{H})(\mathbf{X}\boldsymbol{\beta})/\sigma^2\). But \((\mathbf{I}-\mathbf{H})\mathbf{X} = 0\), therefore \(D\) has the central distribution \(\chi^2(n-p)\) exactly.
The term scaled deviance is sometimes used for \[ \sigma^2D = \sum_{i=1}^n(y_i-\widehat{\mu}_i)^2 \] where \(\widehat{\mu}_i= \mathbf{X}_i' \mathbf{b}\). So if the model fits data well, then \(D\sim \chi^2(n-p)\). The expected value for a random variable with \(\chi^2(n-p)\) distribution is \(n-p\), so the expected value of \(D\) is \(n-p\).
This provides an estimate of \(\sigma^2\) as \[ \widehat{\sigma}^2 = \frac{\sum_{i=1}^{n}(y_i-\widehat{\mu}_i)^2}{n-p} \] or in our lecture note, we said: “An unbiased estimate of the error variance \(\sigma^2\) is” \[ \widehat{\sigma} = \sqrt{\frac{\text{RSS}}{\text{df}}} \] Some program output the scaled deviance for a Normal linear model and call \(\widehat{\sigma}^2\) the scale parameter.
The deviance is also related to the sume of squares of the standardized residulas \[ \sum_{i=1}^m r_i^2 = \frac{1}{\widehat{\sigma}^2}\sum_{i=1}^n(y_i-\widehat{\mu}_i)^2 \] where \(\widehat{\sigma}^2\) is an estimate of \(\sigma^2\). This provides a rough rule of thumb for the overall magnitude of the standardized residuals. If the model fits well so that \(D\sim \chi^2(n-p)\), you would expect \(\sum_{i=1}^m r_i^2 = n-p\), approximately.

Voting data example

gavote <- gavote %>%
  mutate(undercount = (ballots - votes) / ballots) %>%
  rename(usage = rural) %>%
  mutate(pergore = gore / votes, 
         perbush = bush / votes)

lmod <- lm(undercount ~ pergore + perAA, gavote)
sumlmod <- summary(lmod)

Scaled deviance, i.e., output from deviance, is \(\sigma^2D\)

RSS <- sum(residuals(sumlmod)^2)
RSS

## [1] 0.09324918

deviance(lmod)

## [1] 0.09324918

\(\widehat{\sigma}\), i.e., Residual standard error from lm fit

sigma = sqrt(RSS/df.residual(lmod))
sigma

## [1] 0.02444895

sumlmod$sigma

## [1] 0.02444895

Example: Deviance for a Bionomial model

If the response \(Y_1, \ldots, Y_n\) are independent and \(Y_i\sim \text{Bin}(m_i, p_i)\), then the log-likelihood is \[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[y_i\log p_i - y_i\log(1-p_i) + m_i\log(1-p_i) + \log {m_i\choose y_i}\right] \]
For a saturated model, the \(y_i\)’s are independent. \(p_i\in (0,1)\) has no restriction. There are \(n\) parameters, i.e., \(\boldsymbol{\beta} = (p_1,\ldots, p_n)'\). MLE is \(\mathbf{b}_{max} = y_i/m_i\)
For a proposed logistic model, the \(y_i\)’s are independent and \[ p_i = \frac{e^{\mathbf{x_i}'\boldsymbol{\beta}}}{1+e^{\mathbf{x_i}'\boldsymbol{\beta}}} \] Therefore, there are \(q+1\) total number of parameters. And the MLE is \[ \hat p_i = \frac{e^{\mathbf{x_i}'\mathbf{b}}}{1+e^{\mathbf{x_i}'\mathbf{b}}} \]
So the deviance is
\[\begin{eqnarray*} D &=& 2 \sum_i y_i \log(y_i/m_i) + (m_i - y_i) \log(1 - y_i / m_i) \\ & & - 2 \sum_i y_i \log(\widehat{p}_i) + (m_i - y_i) \log(1 - \widehat{p}_i) \\ &=& 2 \sum_i y_i \log(y_i / \widehat{y}_i) + (m_i - y_i) \log(m_i - y_i)/(m_i - \widehat{y}_i), \end{eqnarray*}\] where \(\widehat{y}_i\) are the fitted values from the model.
When \(Y\) is truely binomial and \(m_i\) are relatively large, the deviance \(D\) is approximately \(\chi_{n-q-1}^2\) if the model is correct. A rule of thumb to use this asymptotic approximation is \(m_i \ge 5\). The large p-value indicates that the model has an adequate fit.
We can define the Pearson residuals as \[ r_i^{\text{P}} = \frac{y_i - n_i \widehat{p}_i}{\sqrt{\operatorname{Var} \widehat{y}_i}}. \] Then \[ X^2 = \sum_i \left(r_i^{\text{P}} \right)^2 \] in analogy to \(\text{RSS} = \sum_i r_i^2\) in linear regression.
Or we can define deviance residual similar in Binary outcome models \[ r_{dres_i} = \text{sign}(y_i - m_i\widehat{p}_i)\sqrt{y_i\ln\left(\frac{y}{m_i\widehat{p}_i}\right) + (m_i-y_i)\ln\left(\frac{m_i-y_i}{m_i-m_i\widehat{p}_i}\right)} \]

Data example

binmod <- glm(cbind(damage, 6 - damage) ~ temp, family = binomial, data = orings)
sumbinmod  <- summary(binmod)

sumbinmod

## 
## Call:
## glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial, 
##     data = orings)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9529  -0.7345  -0.4393  -0.2079   1.9565  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.66299    3.29626   3.538 0.000403 ***
## temp        -0.21623    0.05318  -4.066 4.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38.898  on 22  degrees of freedom
## Residual deviance: 16.912  on 21  degrees of freedom
## AIC: 33.675
## 
## Number of Fisher Scoring iterations: 6

tibble(rperson = residuals(binmod, type = "pearson"),
  rdeviance = residuals(binmod, type = "deviance")) %>%
  ggplot() +
  geom_point(mapping = aes(x = rperson, y =rdeviance)) + 
  geom_abline(intercept = 0, slope = 1) + 
  xlab("Pearson residual") + 
  ylab("Deviance residual")

Review of Analysis of Deviance

Biostat 200C

Dr. Jin Zhou @ UCLA

Apr 18, 2023