Count Regression (ELMR Chapter 5)

Biostat 200C

Author

Dr. Jin Zhou @ UCLA

Published

April 24, 2025

Display system information and load tidyverse and faraway packages

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.1    fastmap_1.2.0     cli_3.6.4        
 [5] tools_4.4.1       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.49        jsonlite_1.9.1    xfun_0.51        
[13] digest_0.6.37     rlang_1.1.5       evaluate_1.0.3   
library(tidyverse)
library(faraway)

faraway package contains the datasets in the ELMR book.

1 Galápagos data

The gala data set records the number of plant species and the number of endemic species for 30 Galápagos islands. We are interested in modeling the number of species using a regression.

gala <- as_tibble(gala, rownames = "Island") %>%
  print(n = Inf)
# A tibble: 30 × 8
   Island       Species Endemics    Area Elevation Nearest Scruz Adjacent
   <chr>          <dbl>    <dbl>   <dbl>     <dbl>   <dbl> <dbl>    <dbl>
 1 Baltra            58       23   25.1        346     0.6   0.6     1.84
 2 Bartolome         31       21    1.24       109     0.6  26.3   572.  
 3 Caldwell           3        3    0.21       114     2.8  58.7     0.78
 4 Champion          25        9    0.1         46     1.9  47.4     0.18
 5 Coamano            2        1    0.05        77     1.9   1.9   904.  
 6 Daphne.Major      18       11    0.34       119     8     8       1.84
 7 Daphne.Minor      24        0    0.08        93     6    12       0.34
 8 Darwin            10        7    2.33       168    34.1 290.      2.85
 9 Eden               8        4    0.03        71     0.4   0.4    18.0 
10 Enderby            2        2    0.18       112     2.6  50.2     0.1 
11 Espanola          97       26   58.3        198     1.1  88.3     0.57
12 Fernandina        93       35  634.        1494     4.3  95.3  4669.  
13 Gardner1          58       17    0.57        49     1.1  93.1    58.3 
14 Gardner2           5        4    0.78       227     4.6  62.2     0.21
15 Genovesa          40       19   17.4         76    47.4  92.2   129.  
16 Isabela          347       89 4669.        1707     0.7  28.1   634.  
17 Marchena          51       23  129.         343    29.1  85.9    59.6 
18 Onslow             2        2    0.01        25     3.3  45.9     0.1 
19 Pinta            104       37   59.6        777    29.1 120.    129.  
20 Pinzon           108       33   18.0        458    10.7  10.7     0.03
21 Las.Plazas        12        9    0.23        94     0.5   0.6    25.1 
22 Rabida            70       30    4.89       367     4.4  24.4   572.  
23 SanCristobal     280       65  552.         716    45.2  66.6     0.57
24 SanSalvador      237       81  572.         906     0.2  19.8     4.89
25 SantaCruz        444       95  904.         864     0.6   0       0.52
26 SantaFe           62       28   24.1        259    16.5  16.5     0.52
27 SantaMaria       285       73  171.         640     2.6  49.2     0.1 
28 Seymour           44       16    1.84       147     0.6   9.6    25.1 
29 Tortuga           16        8    1.24       186     6.8  50.9    18.0 
30 Wolf              21       12    2.85       253    34.1 255.      2.33

A histogram of the number of species

gala %>%
  ggplot() + 
  geom_histogram(mapping = aes(x = Species)) + 
  labs(title = "Species on Galápagos Islands")

2 Poisson regession

  • Assume the count response \(Y_i\) follows a Poisson\((\mu_i)\) distribution with density function \[ \mathbb{P}(Y_i = y_i) = e^{-\mu_i} \frac{\mu_i^{y_i}}{y_i!}. \]

  • The Poisson parameter \(\mu_i\) is related to the predictors via an inverse link function \[ \mu_i = e^{\eta_i}, \] where \(\eta_i\) is the linear predictor or systematic component \[ \eta_i = \mathbf{x}_i^T \boldsymbol{\beta}. \]

  • The function \[ \eta_i = g(\mu_i) = \log \mu_i \] that links \(\mathbb{E} Y_i = \mu_i\) to the systematic component is called the link function. This particular link function is called the log link function. The Poisson regression model with log link is also called a log-linear model.

  • Given the \(n\) independent data points \((y_i, \mathbf{x}_i)\), \(i=1,\ldots,n\), the log-likelihood is \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_i y_i \log \mu_i - \mu_i - \log y_i! \\ &=& \sum_i y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - e^{\mathbf{x}_i^T \boldsymbol{\beta}} - \log y_i! \end{eqnarray*}\]

  • We maximize the log-likelihood function to find the MLE of regression coefficient \(\boldsymbol{\beta}\).

modp <- glm(Species ~ . - Endemics - Island, family = poisson, data = gala)
summary(modp)

Call:
glm(formula = Species ~ . - Endemics - Island, family = poisson, 
    data = gala)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***
Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***
Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***
Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***
Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***
Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3510.73  on 29  degrees of freedom
Residual deviance:  716.85  on 24  degrees of freedom
AIC: 889.68

Number of Fisher Scoring iterations: 5

3 Interpretation

  • Regression coefficients reflect the unit change in the log of Poisson mean.
library(gtsummary)
modp %>%
  tbl_regression(intercept = TRUE)
Characteristic log(IRR) 95% CI p-value
(Intercept) 3.2 3.1, 3.3 <0.001
Area 0.00 0.00, 0.00 <0.001
Elevation 0.00 0.00, 0.00 <0.001
Nearest 0.01 0.01, 0.01 <0.001
Scruz -0.01 -0.01, 0.00 <0.001
Adjacent 0.00 0.00, 0.00 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
modp %>%
  tbl_regression(intercept = TRUE, exponentiate = TRUE)
Characteristic IRR 95% CI p-value
(Intercept) 23.4 21.2, 25.9 <0.001
Area 1.00 1.00, 1.00 <0.001
Elevation 1.00 1.00, 1.00 <0.001
Nearest 1.01 1.01, 1.01 <0.001
Scruz 0.99 0.99, 1.00 <0.001
Adjacent 1.00 1.00, 1.00 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

4 Goodness of fit

  • The deviance for the Poisson regression is \[\begin{eqnarray*} D &=& 2 \sum_i [y_i \log(y_i) - y_i] - 2 \sum_i [y_i \log (\widehat{\mu}_i) - \widehat{\mu}_i] \\ &=& 2 \sum_i [y_i \log(y_i / \widehat{\mu}_i) - (y_i - \widehat{\mu}_i)], \end{eqnarray*}\] where \(\widehat{\mu}_i\) are the fitted values from the model. The Poisson deviance is also called the G-statistic.

  • For the Galápagos example, comparing the deviance \(D\) to \(\chi_{n - p}^2\) gives an extremely small p-value (why?), indicating a lack of fit.

  • An alternative goodness of fit test compares the Pearson \(X^2\) \[ X^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} = \sum_i \frac{(y_i - \widehat{\mu}_i)^2}{\widehat{\mu}_i} \] to the asymptotic distribution \(\chi_{n - p}^2\). Again it indicates serious lack of fit.

# predmu = predict(modp, type = "response")
# sum((gala$Species - predmu)^2 / predmu)
(px2 <- sum(residuals(modp, type = "pearson")^2))
[1] 761.9792

5 Diagnostics

  • Plot deviance residuals against linear predictor. We don’t see outliers.
gala %>%
  mutate(devres  = residuals(modp, type = "deviance"), 
         linpred = predict(modp, type = "link")) %>%
  ggplot + 
  geom_point(mapping = aes(x = linpred, y = devres)) + 
  labs(x = "Linear predictor", y = "Deviance residual")

  • Plot Pearson residuals against linear predictor. We don’t see outliers.
gala %>%
  mutate(perres  = residuals(modp, type = "pearson"),
         linpred = predict(modp, type = "link")) %>%
  ggplot + 
  geom_point(mapping = aes(x = linpred, y = perres)) + 
  labs(x = "Linear predictor", y = "Pearson residual")

  • Any high leverage points? Fernandina and Isabela islands have high elevation and relatively large area.
halfnorm(hatvalues(modp))

gala %>%
  slice(c(12, 16))
# A tibble: 2 × 8
  Island     Species Endemics  Area Elevation Nearest Scruz Adjacent
  <chr>        <dbl>    <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>
1 Fernandina      93       35  634.      1494     4.3  95.3    4669.
2 Isabela        347       89 4669.      1707     0.7  28.1     634.
  • Any high influential points?
halfnorm(cooks.distance((modp)))

  • For Poisson distribution, \(\mathbb{E}Y_i = \operatorname{Var}Y_i = \mu_i\). Let’s check this assumption. We use \((y_i - \widehat{\mu}_i)^2\) as a crude estimate of \(\operatorname{Var}Y_i\). Majority of points are above the 45 degree line. Looks like we have overdispersion.
gala %>%
  mutate(predmu = predict(modp, type = "response"), 
         res    = Species - predmu) %>%
  ggplot() +
  geom_point(mapping = aes(x = predmu, y = res^2)) +
  geom_abline(intercept = 0, slope = 1) +
  labs(x = expression(hat(mu)), y = expression((y - hat(mu))^2)) + 
  scale_x_log10() + 
  scale_y_log10()

6 Overdispersion

  • In overdispersion model, we assume an overdispersion parameter \(\phi\) such that \[ \operatorname{Var} Y_i = \phi \mu_i. \]

  • Given MLE, the overdispersion parameter is estimated by \[ \widehat{\phi} = \frac{X^2}{n - p}. \]

(dp <- px2 / modp$df.residual)
[1] 31.74914
  • The inference on regression coefficient is changed with overdispersion parameter. We see predictors Area, Elevation and Adjacent are significant, but not Nearest and Scruz.
summary(modp, dispersion = dp)

Call:
glm(formula = Species ~ . - Endemics - Island, family = poisson, 
    data = gala)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.1548079  0.2915897  10.819  < 2e-16 ***
Area        -0.0005799  0.0001480  -3.918 8.95e-05 ***
Elevation    0.0035406  0.0004925   7.189 6.53e-13 ***
Nearest      0.0088256  0.0102621   0.860    0.390    
Scruz       -0.0057094  0.0035251  -1.620    0.105    
Adjacent    -0.0006630  0.0001653  -4.012 6.01e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 31.74914)

    Null deviance: 3510.73  on 29  degrees of freedom
Residual deviance:  716.85  on 24  degrees of freedom
AIC: 889.68

Number of Fisher Scoring iterations: 5
drop1(modp, scale = dp, test = "F")
Warning in drop1.glm(modp, scale = dp, test = "F"): F test assumes
'quasipoisson' family
Single term deletions

Model:
Species ~ (Island + Endemics + Area + Elevation + Nearest + Scruz + 
    Adjacent) - Endemics - Island

scale:  31.74914 

          Df Deviance    AIC F value    Pr(>F)    
<none>         716.85 889.68                      
Area       1  1204.35 903.03 16.3217 0.0004762 ***
Elevation  1  2389.57 940.36 56.0028 1.007e-07 ***
Nearest    1   739.41 888.39  0.7555 0.3933572    
Scruz      1   813.62 890.72  3.2400 0.0844448 .  
Adjacent   1  1341.45 907.35 20.9119 0.0001230 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7 Quasi-Poisson

  • Quasi-Poisson assumes \[ \mathbb{E}Y_i = \mu_i, \quad \operatorname{Var}Y_i = \phi \mu_i \] only and yields similar inference.
modd <- glm(Species ~ . - Endemics - Island, family = quasipoisson, data = gala)
summary(modd)

Call:
glm(formula = Species ~ . - Endemics - Island, family = quasipoisson, 
    data = gala)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.1548079  0.2915901  10.819 1.03e-10 ***
Area        -0.0005799  0.0001480  -3.918 0.000649 ***
Elevation    0.0035406  0.0004925   7.189 1.98e-07 ***
Nearest      0.0088256  0.0102622   0.860 0.398292    
Scruz       -0.0057094  0.0035251  -1.620 0.118380    
Adjacent    -0.0006630  0.0001653  -4.012 0.000511 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 31.74921)

    Null deviance: 3510.73  on 29  degrees of freedom
Residual deviance:  716.85  on 24  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5
drop1(modd, test = "F")
Single term deletions

Model:
Species ~ (Island + Endemics + Area + Elevation + Nearest + Scruz + 
    Adjacent) - Endemics - Island
          Df Deviance F value    Pr(>F)    
<none>         716.85                      
Area       1  1204.35 16.3217 0.0004762 ***
Elevation  1  2389.57 56.0028 1.007e-07 ***
Nearest    1   739.41  0.7555 0.3933572    
Scruz      1   813.62  3.2400 0.0844448 .  
Adjacent   1  1341.45 20.9119 0.0001230 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8 Negative bionomial regression

  • An alternative to Poisson model is the negative binomial distribution with probability mass function \[ \mathbb{P}(Y = y) = \binom{y + r - 1}{r - 1} (1 - p)^y p^r, \quad y = 0, 1, \ldots \] HW3: show \(\mathbb{E}Y = \mu = rp / (1 - p)\) and \(\operatorname{Var} Y = r p / (1 - p)^2\).

  • We can relate mean response \(\mu\) to the predictors via log link function \[ \eta = \mathbf{x}^T \boldsymbol{\beta} = \log \mu. \]

  • Pre-specify r=1 (\(\theta\) in MASS) and fit negative bionomial regression:

library(MASS)
modn <- glm(Species ~ . - Endemics - Island, family = negative.binomial(1), data = gala)
summary(modn)

Call:
glm(formula = Species ~ . - Endemics - Island, family = negative.binomial(1), 
    data = gala)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.9040718  0.2726617  10.651 1.41e-10 ***
Area        -0.0006338  0.0003143  -2.017  0.05506 .  
Elevation    0.0038575  0.0007561   5.102 3.21e-05 ***
Nearest      0.0027834  0.0149036   0.187  0.85342    
Scruz       -0.0018568  0.0030569  -0.607  0.54929    
Adjacent    -0.0007617  0.0002492  -3.057  0.00542 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.7232273)

    Null deviance: 54.069  on 29  degrees of freedom
Residual deviance: 20.741  on 24  degrees of freedom
AIC: 305.75

Number of Fisher Scoring iterations: 16
  • Estimate parameter r (\(\theta\) in MASS) from data:
glm.nb(Species ~ . - Endemics - Island, data = gala) %>%
  summary()

Call:
glm.nb(formula = Species ~ . - Endemics - Island, data = gala, 
    init.theta = 1.674602286, link = log)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.9065247  0.2510344  11.578  < 2e-16 ***
Area        -0.0006336  0.0002865  -2.211 0.027009 *  
Elevation    0.0038551  0.0006916   5.574 2.49e-08 ***
Nearest      0.0028264  0.0136618   0.207 0.836100    
Scruz       -0.0018976  0.0028096  -0.675 0.499426    
Adjacent    -0.0007605  0.0002278  -3.338 0.000842 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.6746) family taken to be 1)

    Null deviance: 88.431  on 29  degrees of freedom
Residual deviance: 33.196  on 24  degrees of freedom
AIC: 304.22

Number of Fisher Scoring iterations: 1

              Theta:  1.675 
          Std. Err.:  0.442 

 2 x log-likelihood:  -290.223 

9 Zero-inflated count models

  • The bioChemists data set contains 915 biochemistry graduate students. The outcome of interest is the number of articles procuded during the last three years of the PhD.
library(pscl)

(bioChemists <- as_tibble(bioChemists))
# A tibble: 915 × 6
     art fem   mar      kid5   phd  ment
   <int> <fct> <fct>   <dbl> <dbl> <int>
 1     0 Men   Married     0  2.52     7
 2     0 Women Single      0  2.05     6
 3     0 Women Single      0  3.75     6
 4     0 Men   Married     1  1.18     3
 5     0 Women Single      0  3.75    26
 6     0 Women Married     2  3.59     2
 7     0 Women Single      0  3.19     3
 8     0 Men   Married     2  2.96     4
 9     0 Men   Single      0  4.62     6
10     0 Women Married     0  1.25     0
# ℹ 905 more rows
  • Descriptive statistics.
bioChemists %>%
  ggplot() + 
  geom_bar(mapping = aes(x = art, fill = fem)) + 
  labs(x = "Articles", y = "Count", title = "Publications by PhD Students")

bioChemists %>%
  ggplot() + 
  geom_bar(mapping = aes(x = art, fill = mar))

  • We first fit a Poisson regresison. A deviance of 1634.4 on 909 degrees of freedom indicates a lack of fit.
modp <- glm(art ~ ., family = poisson, data = bioChemists)
summary(modp)

Call:
glm(formula = art ~ ., family = poisson, data = bioChemists)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.304617   0.102981   2.958   0.0031 ** 
femWomen    -0.224594   0.054613  -4.112 3.92e-05 ***
marMarried   0.155243   0.061374   2.529   0.0114 *  
kid5        -0.184883   0.040127  -4.607 4.08e-06 ***
phd          0.012823   0.026397   0.486   0.6271    
ment         0.025543   0.002006  12.733  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1817.4  on 914  degrees of freedom
Residual deviance: 1634.4  on 909  degrees of freedom
AIC: 3314.1

Number of Fisher Scoring iterations: 5
  • The observed against fitte values plot shows the prediction for 0 is quite off.
tibble(ocount = table(bioChemists$art)[1:8],    # observed count for 0:7
       pcount = colSums(predprob(modp)[, 1:8]), # expected count for 0:7
       count  = 0:7) %>%
  ggplot(mapping = aes(x = pcount, y = ocount, label = count)) + 
  geom_point() + 
  geom_text(nudge_y = 8) +
  labs(x = "Predicted", y = "Observed")

  • The hurdle model assumes \[\begin{eqnarray*} \mathbb{P}(Y = 0) &=& f_1(0), \\ \mathbb{P}(Y = j) &=& \frac{1 - f_1(0)}{1 - f_2(0)} f_2(j), \quad j > 0. \end{eqnarray*}\] We shall use a Bernoulli response model for \(f_1(1)\) to model the probability of getting over the hurdle 0, and use the zero-truncated Poisson model for \(f_2\) to model the probability of observing an outcome greater than 0.
library(gtsummary)
modh <- hurdle(art ~ ., data = bioChemists)
summary(modh)

Call:
hurdle(formula = art ~ ., data = bioChemists)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.4105 -0.8913 -0.2817  0.5530  7.0324 

Count model coefficients (truncated poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.67114    0.12246   5.481 4.24e-08 ***
femWomen    -0.22858    0.06522  -3.505 0.000457 ***
marMarried   0.09649    0.07283   1.325 0.185209    
kid5        -0.14219    0.04845  -2.934 0.003341 ** 
phd         -0.01273    0.03130  -0.407 0.684343    
ment         0.01875    0.00228   8.222  < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.23680    0.29552   0.801   0.4230    
femWomen    -0.25115    0.15911  -1.579   0.1144    
marMarried   0.32623    0.18082   1.804   0.0712 .  
kid5        -0.28525    0.11113  -2.567   0.0103 *  
phd          0.02222    0.07956   0.279   0.7800    
ment         0.08012    0.01302   6.155 7.52e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 12 
Log-likelihood: -1605 on 12 Df
  • The zero-inflated Pisson (ZIP) assumes a mixture model \[\begin{eqnarray*} \mathbb{P}(Y=0) &=& \phi + (1 - \phi) f(0), \\ \mathbb{P}(Y = j) &=& (1 - \phi) f(j), \quad j > 0, \end{eqnarray*}\] where \(f\) is a Poisson model. The parameter \(\phi\) represents the probaility of always observing a zero count and is modeled as a Bernoulli outcome.
modz <- zeroinfl(art ~ ., data = bioChemists)
summary(modz)

Call:
zeroinfl(formula = art ~ ., data = bioChemists)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.3253 -0.8652 -0.2826  0.5404  7.2976 

Count model coefficients (poisson with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.640838   0.121307   5.283 1.27e-07 ***
femWomen    -0.209145   0.063405  -3.299 0.000972 ***
marMarried   0.103751   0.071111   1.459 0.144565    
kid5        -0.143320   0.047429  -3.022 0.002513 ** 
phd         -0.006166   0.031008  -0.199 0.842378    
ment         0.018098   0.002294   7.888 3.07e-15 ***

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -0.577059   0.509386  -1.133  0.25728   
femWomen     0.109746   0.280082   0.392  0.69518   
marMarried  -0.354014   0.317611  -1.115  0.26502   
kid5         0.217097   0.196482   1.105  0.26919   
phd          0.001274   0.145263   0.009  0.99300   
ment        -0.134114   0.045243  -2.964  0.00303 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 21 
Log-likelihood: -1605 on 12 Df

The coefficients for the binary response part take opposite signs from those from the hurdle model, because hurdle model models the probability of observing a nonzero and ZIP models the probability of always observing zero.

  • The hurdle model and ZIP give very similar prediction
tibble(fitted_h = fitted(modh),
       fitted_z = fitted(modz)) %>%
  ggplot() + 
  geom_point(mapping = aes(x = fitted_h, y = fitted_z)) + 
  geom_abline(intercept = 0, slope = 1) +
  labs(x = "Hurdle predictions", y = "ZIP predictions")

  • We can use likelihood ratio (LRT) test to compare nested models. Consider a simplied model, where the Poisson model uses predictors fem, kid5, and ment and the Bernoulli model uses predictor ment.
modz2 <- zeroinfl(art ~ fem + kid5 + ment | ment, data = bioChemists)
summary(modz2)

Call:
zeroinfl(formula = art ~ fem + kid5 + ment | ment, data = bioChemists)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.2802 -0.8807 -0.2718  0.5131  7.4788 

Count model coefficients (poisson with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.694517   0.053025  13.098  < 2e-16 ***
femWomen    -0.233857   0.058400  -4.004 6.22e-05 ***
kid5        -0.126516   0.039668  -3.189  0.00143 ** 
ment         0.018004   0.002224   8.096 5.67e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.68488    0.20529  -3.336 0.000849 ***
ment        -0.12680    0.03981  -3.185 0.001448 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 13 
Log-likelihood: -1608 on 6 Df

The LRT yields a large p-value, indicating the smaller model is justifiable.

lrt <- 2 * (modz$loglik - modz2$loglik)
pchisq(lrt, 6, lower.tail = FALSE)
[1] 0.4041153
  • Interpretation
exp(coef(modz2))
count_(Intercept)    count_femWomen        count_kid5        count_ment 
        2.0027412         0.7914748         0.8811604         1.0181669 
 zero_(Intercept)         zero_ment 
        0.5041522         0.8809082 
  • Prediction. For a single male with no children whose mentor procuded six articles, we predict that the chance he produces no articles is 0.278.
tibble(fem = "Men", mar = "Single", kid5 = 0, ment = 6) %>%
  predict(modz2, newdata = ., type = "prob")
          0         1       2         3          4          5          6
1 0.2775879 0.1939403 0.21636 0.1609142 0.08975799 0.04005363 0.01489462
            7           8            9           10           11           12
1 0.004747556 0.001324094 0.0003282578 7.324092e-05 1.485593e-05 2.762214e-06
            13           14           15           16          17           18
1 4.740812e-07 7.555505e-08 1.123857e-08 1.567219e-09 2.05693e-10 2.549681e-11
            19
1 2.994132e-12

The zero part of model contributes 0.191 and the Poisson part contributes 0.087.

tibble(fem = "Men", mar = "Single", kid5 = 0, ment = 6) %>%
  predict(modz2, newdata = ., type = "zero")
       1 
0.190666