Display system information and load tidyverse and faraway packages
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.4.1 fastmap_1.2.0 cli_3.6.4
[5] tools_4.4.1 htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10
[9] rmarkdown_2.29 knitr_1.49 jsonlite_1.9.1 xfun_0.51
[13] digest_0.6.37 rlang_1.1.5 evaluate_1.0.3
library(tidyverse)library(faraway)
faraway package contains the datasets in the ELMR book.
1 Galápagos data
The gala data set records the number of plant species and the number of endemic species for 30 Galápagos islands. We are interested in modeling the number of species using a regression.
gala <-as_tibble(gala, rownames ="Island") %>%print(n =Inf)
gala %>%ggplot() +geom_histogram(mapping =aes(x = Species)) +labs(title ="Species on Galápagos Islands")
2 Poisson regession
Assume the count response \(Y_i\) follows a Poisson\((\mu_i)\) distribution with density function \[
\mathbb{P}(Y_i = y_i) = e^{-\mu_i} \frac{\mu_i^{y_i}}{y_i!}.
\]
The Poisson parameter \(\mu_i\) is related to the predictors via an inverse link function\[
\mu_i = e^{\eta_i},
\] where \(\eta_i\) is the linear predictor or systematic component\[
\eta_i = \mathbf{x}_i^T \boldsymbol{\beta}.
\]
The function \[
\eta_i = g(\mu_i) = \log \mu_i
\] that links \(\mathbb{E} Y_i = \mu_i\) to the systematic component is called the link function. This particular link function is called the log link function. The Poisson regression model with log link is also called a log-linear model.
Given the \(n\) independent data points \((y_i, \mathbf{x}_i)\), \(i=1,\ldots,n\), the log-likelihood is \[\begin{eqnarray*}
\ell(\boldsymbol{\beta}) &=& \sum_i y_i \log \mu_i - \mu_i - \log y_i! \\
&=& \sum_i y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - e^{\mathbf{x}_i^T \boldsymbol{\beta}} - \log y_i!
\end{eqnarray*}\]
We maximize the log-likelihood function to find the MLE of regression coefficient \(\boldsymbol{\beta}\).
modp <-glm(Species ~ . - Endemics - Island, family = poisson, data = gala)summary(modp)
Call:
glm(formula = Species ~ . - Endemics - Island, family = poisson,
data = gala)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***
Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***
Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***
Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***
Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***
Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3510.73 on 29 degrees of freedom
Residual deviance: 716.85 on 24 degrees of freedom
AIC: 889.68
Number of Fisher Scoring iterations: 5
3 Interpretation
Regression coefficients reflect the unit change in the log of Poisson mean.
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
4 Goodness of fit
The deviance for the Poisson regression is \[\begin{eqnarray*}
D &=& 2 \sum_i [y_i \log(y_i) - y_i] - 2 \sum_i [y_i \log (\widehat{\mu}_i) - \widehat{\mu}_i] \\
&=& 2 \sum_i [y_i \log(y_i / \widehat{\mu}_i) - (y_i - \widehat{\mu}_i)],
\end{eqnarray*}\] where \(\widehat{\mu}_i\) are the fitted values from the model. The Poisson deviance is also called the G-statistic.
For the Galápagos example, comparing the deviance \(D\) to \(\chi_{n - p}^2\) gives an extremely small p-value (why?), indicating a lack of fit.
An alternative goodness of fit test compares the Pearson \(X^2\)\[
X^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} = \sum_i \frac{(y_i - \widehat{\mu}_i)^2}{\widehat{\mu}_i}
\] to the asymptotic distribution \(\chi_{n - p}^2\). Again it indicates serious lack of fit.
# predmu = predict(modp, type = "response")# sum((gala$Species - predmu)^2 / predmu)(px2 <-sum(residuals(modp, type ="pearson")^2))
[1] 761.9792
5 Diagnostics
Plot deviance residuals against linear predictor. We don’t see outliers.
gala %>%mutate(devres =residuals(modp, type ="deviance"), linpred =predict(modp, type ="link")) %>% ggplot +geom_point(mapping =aes(x = linpred, y = devres)) +labs(x ="Linear predictor", y ="Deviance residual")
Plot Pearson residuals against linear predictor. We don’t see outliers.
gala %>%mutate(perres =residuals(modp, type ="pearson"),linpred =predict(modp, type ="link")) %>% ggplot +geom_point(mapping =aes(x = linpred, y = perres)) +labs(x ="Linear predictor", y ="Pearson residual")
Any high leverage points? Fernandina and Isabela islands have high elevation and relatively large area.
halfnorm(hatvalues(modp))
gala %>%slice(c(12, 16))
# A tibble: 2 × 8
Island Species Endemics Area Elevation Nearest Scruz Adjacent
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fernandina 93 35 634. 1494 4.3 95.3 4669.
2 Isabela 347 89 4669. 1707 0.7 28.1 634.
Any high influential points?
halfnorm(cooks.distance((modp)))
For Poisson distribution, \(\mathbb{E}Y_i = \operatorname{Var}Y_i = \mu_i\). Let’s check this assumption. We use \((y_i - \widehat{\mu}_i)^2\) as a crude estimate of \(\operatorname{Var}Y_i\). Majority of points are above the 45 degree line. Looks like we have overdispersion.
gala %>%mutate(predmu =predict(modp, type ="response"), res = Species - predmu) %>%ggplot() +geom_point(mapping =aes(x = predmu, y = res^2)) +geom_abline(intercept =0, slope =1) +labs(x =expression(hat(mu)), y =expression((y -hat(mu))^2)) +scale_x_log10() +scale_y_log10()
6 Overdispersion
In overdispersion model, we assume an overdispersion parameter\(\phi\) such that \[
\operatorname{Var} Y_i = \phi \mu_i.
\]
Given MLE, the overdispersion parameter is estimated by \[
\widehat{\phi} = \frac{X^2}{n - p}.
\]
(dp <- px2 / modp$df.residual)
[1] 31.74914
The inference on regression coefficient is changed with overdispersion parameter. We see predictors Area, Elevation and Adjacent are significant, but not Nearest and Scruz.
summary(modp, dispersion = dp)
Call:
glm(formula = Species ~ . - Endemics - Island, family = poisson,
data = gala)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.1548079 0.2915897 10.819 < 2e-16 ***
Area -0.0005799 0.0001480 -3.918 8.95e-05 ***
Elevation 0.0035406 0.0004925 7.189 6.53e-13 ***
Nearest 0.0088256 0.0102621 0.860 0.390
Scruz -0.0057094 0.0035251 -1.620 0.105
Adjacent -0.0006630 0.0001653 -4.012 6.01e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 31.74914)
Null deviance: 3510.73 on 29 degrees of freedom
Residual deviance: 716.85 on 24 degrees of freedom
AIC: 889.68
Number of Fisher Scoring iterations: 5
drop1(modp, scale = dp, test ="F")
Warning in drop1.glm(modp, scale = dp, test = "F"): F test assumes
'quasipoisson' family
Quasi-Poisson assumes \[
\mathbb{E}Y_i = \mu_i, \quad \operatorname{Var}Y_i = \phi \mu_i
\] only and yields similar inference.
modd <-glm(Species ~ . - Endemics - Island, family = quasipoisson, data = gala)summary(modd)
Call:
glm(formula = Species ~ . - Endemics - Island, family = quasipoisson,
data = gala)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1548079 0.2915901 10.819 1.03e-10 ***
Area -0.0005799 0.0001480 -3.918 0.000649 ***
Elevation 0.0035406 0.0004925 7.189 1.98e-07 ***
Nearest 0.0088256 0.0102622 0.860 0.398292
Scruz -0.0057094 0.0035251 -1.620 0.118380
Adjacent -0.0006630 0.0001653 -4.012 0.000511 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 31.74921)
Null deviance: 3510.73 on 29 degrees of freedom
Residual deviance: 716.85 on 24 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
drop1(modd, test ="F")
Single term deletions
Model:
Species ~ (Island + Endemics + Area + Elevation + Nearest + Scruz +
Adjacent) - Endemics - Island
Df Deviance F value Pr(>F)
<none> 716.85
Area 1 1204.35 16.3217 0.0004762 ***
Elevation 1 2389.57 56.0028 1.007e-07 ***
Nearest 1 739.41 0.7555 0.3933572
Scruz 1 813.62 3.2400 0.0844448 .
Adjacent 1 1341.45 20.9119 0.0001230 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8 Negative bionomial regression
An alternative to Poisson model is the negative binomial distribution with probability mass function \[
\mathbb{P}(Y = y) = \binom{y + r - 1}{r - 1} (1 - p)^y p^r, \quad y = 0, 1, \ldots
\] HW3: show \(\mathbb{E}Y = \mu = rp / (1 - p)\) and \(\operatorname{Var} Y = r p / (1 - p)^2\).
We can relate mean response \(\mu\) to the predictors via log link function \[
\eta = \mathbf{x}^T \boldsymbol{\beta} = \log \mu.
\]
Pre-specify r=1 (\(\theta\) in MASS) and fit negative bionomial regression:
library(MASS)modn <-glm(Species ~ . - Endemics - Island, family =negative.binomial(1), data = gala)summary(modn)
Call:
glm(formula = Species ~ . - Endemics - Island, family = negative.binomial(1),
data = gala)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9040718 0.2726617 10.651 1.41e-10 ***
Area -0.0006338 0.0003143 -2.017 0.05506 .
Elevation 0.0038575 0.0007561 5.102 3.21e-05 ***
Nearest 0.0027834 0.0149036 0.187 0.85342
Scruz -0.0018568 0.0030569 -0.607 0.54929
Adjacent -0.0007617 0.0002492 -3.057 0.00542 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1) family taken to be 0.7232273)
Null deviance: 54.069 on 29 degrees of freedom
Residual deviance: 20.741 on 24 degrees of freedom
AIC: 305.75
Number of Fisher Scoring iterations: 16
Estimate parameter r (\(\theta\) in MASS) from data:
glm.nb(Species ~ . - Endemics - Island, data = gala) %>%summary()
Call:
glm.nb(formula = Species ~ . - Endemics - Island, data = gala,
init.theta = 1.674602286, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.9065247 0.2510344 11.578 < 2e-16 ***
Area -0.0006336 0.0002865 -2.211 0.027009 *
Elevation 0.0038551 0.0006916 5.574 2.49e-08 ***
Nearest 0.0028264 0.0136618 0.207 0.836100
Scruz -0.0018976 0.0028096 -0.675 0.499426
Adjacent -0.0007605 0.0002278 -3.338 0.000842 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1.6746) family taken to be 1)
Null deviance: 88.431 on 29 degrees of freedom
Residual deviance: 33.196 on 24 degrees of freedom
AIC: 304.22
Number of Fisher Scoring iterations: 1
Theta: 1.675
Std. Err.: 0.442
2 x log-likelihood: -290.223
9 Zero-inflated count models
The bioChemists data set contains 915 biochemistry graduate students. The outcome of interest is the number of articles procuded during the last three years of the PhD.
# A tibble: 915 × 6
art fem mar kid5 phd ment
<int> <fct> <fct> <dbl> <dbl> <int>
1 0 Men Married 0 2.52 7
2 0 Women Single 0 2.05 6
3 0 Women Single 0 3.75 6
4 0 Men Married 1 1.18 3
5 0 Women Single 0 3.75 26
6 0 Women Married 2 3.59 2
7 0 Women Single 0 3.19 3
8 0 Men Married 2 2.96 4
9 0 Men Single 0 4.62 6
10 0 Women Married 0 1.25 0
# ℹ 905 more rows
Descriptive statistics.
bioChemists %>%ggplot() +geom_bar(mapping =aes(x = art, fill = fem)) +labs(x ="Articles", y ="Count", title ="Publications by PhD Students")
bioChemists %>%ggplot() +geom_bar(mapping =aes(x = art, fill = mar))
We first fit a Poisson regresison. A deviance of 1634.4 on 909 degrees of freedom indicates a lack of fit.
modp <-glm(art ~ ., family = poisson, data = bioChemists)summary(modp)
Call:
glm(formula = art ~ ., family = poisson, data = bioChemists)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.304617 0.102981 2.958 0.0031 **
femWomen -0.224594 0.054613 -4.112 3.92e-05 ***
marMarried 0.155243 0.061374 2.529 0.0114 *
kid5 -0.184883 0.040127 -4.607 4.08e-06 ***
phd 0.012823 0.026397 0.486 0.6271
ment 0.025543 0.002006 12.733 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1634.4 on 909 degrees of freedom
AIC: 3314.1
Number of Fisher Scoring iterations: 5
The observed against fitte values plot shows the prediction for 0 is quite off.
tibble(ocount =table(bioChemists$art)[1:8], # observed count for 0:7pcount =colSums(predprob(modp)[, 1:8]), # expected count for 0:7count =0:7) %>%ggplot(mapping =aes(x = pcount, y = ocount, label = count)) +geom_point() +geom_text(nudge_y =8) +labs(x ="Predicted", y ="Observed")
The hurdle model assumes \[\begin{eqnarray*}
\mathbb{P}(Y = 0) &=& f_1(0), \\
\mathbb{P}(Y = j) &=& \frac{1 - f_1(0)}{1 - f_2(0)} f_2(j), \quad j > 0.
\end{eqnarray*}\] We shall use a Bernoulli response model for \(f_1(1)\) to model the probability of getting over the hurdle 0, and use the zero-truncated Poisson model for \(f_2\) to model the probability of observing an outcome greater than 0.
library(gtsummary)modh <-hurdle(art ~ ., data = bioChemists)summary(modh)
Call:
hurdle(formula = art ~ ., data = bioChemists)
Pearson residuals:
Min 1Q Median 3Q Max
-2.4105 -0.8913 -0.2817 0.5530 7.0324
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.67114 0.12246 5.481 4.24e-08 ***
femWomen -0.22858 0.06522 -3.505 0.000457 ***
marMarried 0.09649 0.07283 1.325 0.185209
kid5 -0.14219 0.04845 -2.934 0.003341 **
phd -0.01273 0.03130 -0.407 0.684343
ment 0.01875 0.00228 8.222 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.23680 0.29552 0.801 0.4230
femWomen -0.25115 0.15911 -1.579 0.1144
marMarried 0.32623 0.18082 1.804 0.0712 .
kid5 -0.28525 0.11113 -2.567 0.0103 *
phd 0.02222 0.07956 0.279 0.7800
ment 0.08012 0.01302 6.155 7.52e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 12
Log-likelihood: -1605 on 12 Df
The zero-inflated Pisson (ZIP) assumes a mixture model \[\begin{eqnarray*}
\mathbb{P}(Y=0) &=& \phi + (1 - \phi) f(0), \\
\mathbb{P}(Y = j) &=& (1 - \phi) f(j), \quad j > 0,
\end{eqnarray*}\] where \(f\) is a Poisson model. The parameter \(\phi\) represents the probaility of always observing a zero count and is modeled as a Bernoulli outcome.
modz <-zeroinfl(art ~ ., data = bioChemists)summary(modz)
Call:
zeroinfl(formula = art ~ ., data = bioChemists)
Pearson residuals:
Min 1Q Median 3Q Max
-2.3253 -0.8652 -0.2826 0.5404 7.2976
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.640838 0.121307 5.283 1.27e-07 ***
femWomen -0.209145 0.063405 -3.299 0.000972 ***
marMarried 0.103751 0.071111 1.459 0.144565
kid5 -0.143320 0.047429 -3.022 0.002513 **
phd -0.006166 0.031008 -0.199 0.842378
ment 0.018098 0.002294 7.888 3.07e-15 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.577059 0.509386 -1.133 0.25728
femWomen 0.109746 0.280082 0.392 0.69518
marMarried -0.354014 0.317611 -1.115 0.26502
kid5 0.217097 0.196482 1.105 0.26919
phd 0.001274 0.145263 0.009 0.99300
ment -0.134114 0.045243 -2.964 0.00303 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 21
Log-likelihood: -1605 on 12 Df
The coefficients for the binary response part take opposite signs from those from the hurdle model, because hurdle model models the probability of observing a nonzero and ZIP models the probability of always observing zero.
The hurdle model and ZIP give very similar prediction
We can use likelihood ratio (LRT) test to compare nested models. Consider a simplied model, where the Poisson model uses predictors fem, kid5, and ment and the Bernoulli model uses predictor ment.
modz2 <-zeroinfl(art ~ fem + kid5 + ment | ment, data = bioChemists)summary(modz2)
Call:
zeroinfl(formula = art ~ fem + kid5 + ment | ment, data = bioChemists)
Pearson residuals:
Min 1Q Median 3Q Max
-2.2802 -0.8807 -0.2718 0.5131 7.4788
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.694517 0.053025 13.098 < 2e-16 ***
femWomen -0.233857 0.058400 -4.004 6.22e-05 ***
kid5 -0.126516 0.039668 -3.189 0.00143 **
ment 0.018004 0.002224 8.096 5.67e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.68488 0.20529 -3.336 0.000849 ***
ment -0.12680 0.03981 -3.185 0.001448 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 13
Log-likelihood: -1608 on 6 Df
The LRT yields a large p-value, indicating the smaller model is justifiable.