Display system information and load tidyverse and faraway packages
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.4.1 fastmap_1.2.0 cli_3.6.4
[5] tools_4.4.1 htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10
[9] rmarkdown_2.29 knitr_1.49 jsonlite_1.9.1 xfun_0.51
[13] digest_0.6.37 rlang_1.1.5 evaluate_1.0.3
library(tidyverse)library(faraway)
1 Log-Likelihood ratio
A saturated model
If there are \(n\) observations \(Y_i\), \(i=1,\ldots, n\), all with potentially different values for the linear component \(X_i'\boldsymbol{\beta}\), then we can specify a model with \(n\) parameters to let the predicted \(Y_i\) be the same as observed \(Y_i\).
If some of the observations correspond to the same linear combinations of the predictors, the number of parameters in the saturated model can be less than \(n\).
We use \(m\) to denote the number of parameters can be estimated in a saturated model.
Let \(\boldsymbol{\beta}_{max}\) be the parameter vector from the saturated model.
Let \(\mathbf{b}_{max}\) be MLE from the saturated model and \(L(\mathbf{b}_{max})\) be the likelihood evaluated at \(\mathbf{b}_{max}\), which is larger than any other likelihood function.
Let \(L(\mathbf{b})\) be the maximum value of the likelihood function fo the model of interest.
Then the likelihood ratio is \[
\lambda = \frac{L(\mathbf{b}_{max})}{L(\mathbf{b})}
\] provides a way of assessing the goodness of fit for the model.
In terms of log-likelihood \[
\log \lambda = \ell(\mathbf{b}_{max})- \ell(\mathbf{b})
\]
2 Deviance
The Deviance, also called likelihood (ratio) statistic, is
Using Taylor expansion, one can prove the sample distribution of the deviance is approximately \[
D\sim \chi^2(m-p, \nu)
\]
3 Example: Deviance for a Normal linear model
In linear models with normality assumption, for \(i=1,\ldots, n\)\[
\mbox{E}(Y_i) = \mu_i = \mathbf{x}_i'\boldsymbol{\beta}; \quad Y_i \sim N(\mu_i, \sigma^2)
\] where \(Y_i\) are independent. The log-likelihood function is \[
\ell(\mathbf{\beta}) = -\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\mu_i)^2-\frac{n}{2}\log(2\pi\sigma^2)
\]
For a saturated model, we use \(y_i\) as the predicted value. Therefore \[
\ell(\mathbf{b}_{max}) = -\frac{n}{2}\log(2\pi \sigma^2)
\]
For a proposed model with \(p<n\) parameters, the maximum likelihood estimators are \[
b = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}
\]
The corresponding maximum value of log-likelihood is \[
\ell(\mathbf{b}) = -\frac{n}{2}\log(2\pi\sigma^2) -\sum_{i=1}^n \frac{1}{2\sigma^2}(y_i -\mathbf{X}_i' \mathbf{b})^2.
\]
Therefore the Deviance is \[
D = 2(\ell(\mathbf{b}_{max}) - \ell(\mathbf{b})) = -\sum_{i=1}^n \frac{1}{\sigma^2}(y_i -\mathbf{X}_i' \mathbf{b})^2 = \frac{1}{\sigma^2} (\mathbf{y}-\mathbf{X}\mathbf{b})'(\mathbf{y}-\mathbf{X}\mathbf{b})
\] And we know \[
\mathbf{y}-\mathbf{X}\mathbf{b} = [\mathbf{I}-\mathbf{H}]\mathbf{y}
\] where \(H = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) and \[
(\mathbf{y}-\mathbf{X}\mathbf{b})'(\mathbf{y}-\mathbf{X}\mathbf{b}) = \mathbf{y}'(\mathbf{I}-\mathbf{H})\mathbf{y}
\] Since the rank of \(\mathbf{I}-\mathbf{H}\) is \(n-p\), \(D\) has a chi-squared distribution with \(n-p\) degrees of freedom and non-centrality parameter \((\mathbf{X}\boldsymbol{\beta})'(\mathbf{I}-\mathbf{H})(\mathbf{X}\boldsymbol{\beta})/\sigma^2\). But \((\mathbf{I}-\mathbf{H})\mathbf{X} = 0\), therefore \(D\) has the central distribution \(\chi^2(n-p)\) exactly.
The term scaled deviance is sometimes used for \[
\sigma^2D = \sum_{i=1}^n(y_i-\widehat{\mu}_i)^2
\] where \(\widehat{\mu}_i= \mathbf{X}_i' \mathbf{b}\). So if the model fits data well, then \(D\sim \chi^2(n-p)\). The expected value for a random variable with \(\chi^2(n-p)\) distribution is \(n-p\), so the expected value of \(D\) is \(n-p\).
This provides an estimate of \(\sigma^2\) as \[
\widehat{\sigma}^2 = \frac{\sum_{i=1}^{n}(y_i-\widehat{\mu}_i)^2}{n-p}
\] or in our lecture note, we said: “An unbiased estimate of the error variance \(\sigma^2\) is” \[
\widehat{\sigma} = \sqrt{\frac{\text{RSS}}{\text{df}}}
\] Some program output the scaled deviance for a Normal linear model and call \(\widehat{\sigma}^2\) the scale parameter.
The deviance is also related to the sume of squares of the standardized residulas \[
\sum_{i=1}^m r_i^2 = \frac{1}{\widehat{\sigma}^2}\sum_{i=1}^n(y_i-\widehat{\mu}_i)^2
\] where \(\widehat{\sigma}^2\) is an estimate of \(\sigma^2\). This provides a rough rule of thumb for the overall magnitude of the standardized residuals. If the model fits well so that \(D\sim \chi^2(n-p)\), you would expect \(\sum_{i=1}^m r_i^2 = n-p\), approximately.
Scaled deviance, i.e., output from deviance, is \(\sigma^2D\)
RSS <-sum(residuals(sumlmod)^2)RSS
[1] 0.09324918
deviance(lmod)
[1] 0.09324918
\(\widehat{\sigma}\), i.e., Residual standard error from lm fit
sigma =sqrt(RSS/df.residual(lmod))sigma
[1] 0.02444895
sumlmod$sigma
[1] 0.02444895
4 Example: Deviance for a Bionomial model
If the response \(Y_1, \ldots, Y_n\) are independent and \(Y_i\sim \text{Bin}(m_i, p_i)\), then the log-likelihood is \[
\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[y_i\log p_i - y_i\log(1-p_i) + m_i\log(1-p_i) + \log {m_i\choose y_i}\right]
\]
For a saturated model, the \(y_i\)’s are independent. \(p_i\in (0,1)\) has no restriction. There are \(n\) parameters, i.e., \(\boldsymbol{\beta} = (p_1,\ldots, p_n)'\). MLE is \(\mathbf{b}_{max} = y_i/m_i\)
For a proposed logistic model, the \(y_i\)’s are independent and \[
p_i = \frac{e^{\mathbf{x_i}'\boldsymbol{\beta}}}{1+e^{\mathbf{x_i}'\boldsymbol{\beta}}}
\] Therefore, there are \(q+1\) total number of parameters. And the MLE is \[
\hat p_i = \frac{e^{\mathbf{x_i}'\mathbf{b}}}{1+e^{\mathbf{x_i}'\mathbf{b}}}
\]
So the deviance is \[\begin{eqnarray*}
D &=& 2 \sum_i y_i \log(y_i/m_i) + (m_i - y_i) \log(1 - y_i / m_i) \\
& & - 2 \sum_i y_i \log(\widehat{p}_i) + (m_i - y_i) \log(1 - \widehat{p}_i) \\
&=& 2 \sum_i y_i \log(y_i / \widehat{y}_i) + (m_i - y_i) \log(m_i - y_i)/(m_i - \widehat{y}_i),
\end{eqnarray*}\] where \(\widehat{y}_i\) are the fitted values from the model.
When \(Y\) is truely binomial and \(m_i\) are relatively large, the deviance \(D\) is approximately \(\chi_{n-q-1}^2\) if the model is correct. A rule of thumb to use this asymptotic approximation is \(m_i \ge 5\). The large p-value indicates that the model has an adequate fit.
We can define the Pearson residuals as \[
r_i^{\text{P}} = \frac{y_i - n_i \widehat{p}_i}{\sqrt{\operatorname{Var} \widehat{y}_i}}.
\] Then \[
X^2 = \sum_i \left(r_i^{\text{P}} \right)^2
\] in analogy to \(\text{RSS} = \sum_i r_i^2\) in linear regression.
Or we can define deviance residual similar in Binary outcome models \[
r_{dres_i} = \text{sign}(y_i - m_i\widehat{p}_i)\sqrt{y_i\ln\left(\frac{y}{m_i\widehat{p}_i}\right) + (m_i-y_i)\ln\left(\frac{m_i-y_i}{m_i-m_i\widehat{p}_i}\right)}
\]
4.1 Data example
binmod <-glm(cbind(damage, 6- damage) ~ temp, family = binomial, data = orings)sumbinmod <-summary(binmod)sumbinmod
Call:
glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial,
data = orings)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.66299 3.29626 3.538 0.000403 ***
temp -0.21623 0.05318 -4.066 4.78e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 38.898 on 22 degrees of freedom
Residual deviance: 16.912 on 21 degrees of freedom
AIC: 33.675
Number of Fisher Scoring iterations: 6
tibble(rperson =residuals(binmod, type ="pearson"),rdeviance =residuals(binmod, type ="deviance")) %>%ggplot() +geom_point(mapping =aes(x = rperson, y =rdeviance)) +geom_abline(intercept =0, slope =1) +xlab("Pearson residual") +ylab("Deviance residual")