Contingency Tables (ELMR Chapter 6)

Biostat 200C

Author

Dr. Jin Zhou @ UCLA

Published

May 4, 2025

Display system information and load tidyverse and faraway packages

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.1    fastmap_1.2.0     cli_3.6.4        
 [5] tools_4.4.1       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.49        jsonlite_1.9.1    xfun_0.51        
[13] digest_0.6.37     rlang_1.1.5       evaluate_1.0.3   
library(tidyverse)
library(faraway)

faraway package contains the datasets in the ELMR book.

1 Two-by-two tables

Image source: https://www.gep.com/mind/blog/outlook-for-the-global-semiconductor-silicon-wafer-industry

  • Semiconductor wafer quality data. The research question is whether the presence of particles on the die affects the quality of the wafer being produced.
(wafer <- tibble(
  y        = c(320, 14, 80, 36),
  particle = gl(2, 1, length = 4, labels = c("no", "yes")),
  quality  = gl(2, 2, labels = c("good", "bad"))))
# A tibble: 4 × 3
      y particle quality
  <dbl> <fct>    <fct>  
1   320 no       good   
2    14 yes      good   
3    80 no       bad    
4    36 yes      bad    

This kind of data is conveniently presented as a \(2 \times 2\) contingency table.

(ytable <- xtabs(y ~ particle + quality, data = wafer))
        quality
particle good bad
     no   320  80
     yes   14  36
  • The usual Pearson \(X^2\) test compares the test statistic \[ X^2 = \sum_{i, j} \frac{(y_{ij} - \widehat \mu_{ij})^2}{\widehat \mu_{ij}}, \] where \[ \widehat \mu_{ij} = n \widehat p_i \widehat p_j = n \frac{\sum_j y_{ij}}{n} \frac{\sum_i y_{ij}}{n}, \] to \(\chi_1^2\).
summary(ytable)
Call: xtabs(formula = y ~ particle + quality, data = wafer)
Number of cases in table: 450 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 62.81, df = 1, p-value = 2.274e-15
  • We will analyze this data using 4 different models according to 4 different sampling schemes. They all reach the same conclusion.

    1. We observed the manufacturing process for a certain period of time and observed 450 wafers. The data were then cross-classified. We could use a Poisson model.
    2. We decided to sample 450 wafers. The data were then cross-classified. We could use a multinomial model.
    3. We selected 400 wafers without particles and 50 wafers with particles and then recorded the good or bad outcome. We could use a binomial model.
    4. We selected 400 wafers without particles and 50 wafers with particles that also included, by design, 334 good wafers and 116 bad ones. We could use a hypergeometric model.

1.1 Poisson model

  • We observe the manufacturing process for a certain period of time and observe 450 wafers. The data are then cross-classified. We model the observed counts by Poisson.
library(gtsummary)
glm(y ~ particle + quality, family = poisson, data = wafer) %>%
  #tbl_regression()
  summary()

Call:
glm(formula = y ~ particle + quality, family = poisson, data = wafer)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   5.6934     0.0572  99.535   <2e-16 ***
particleyes  -2.0794     0.1500 -13.863   <2e-16 ***
qualitybad   -1.0575     0.1078  -9.813   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 474.10  on 3  degrees of freedom
Residual deviance:  54.03  on 1  degrees of freedom
AIC: 83.774

Number of Fisher Scoring iterations: 5
  • The null (intercept-only model) model assumes that the rate is constant across particle and quality levels. The analysis of deviance compares \(474.10 - 54.03\) to 2 degrees of freedom, indicating null model is rejected in favor of our two predictor Poisson model.

  • If we add an interaction term particle * quality, the model costs 4 parameters, the same number as the observations. It is equivalent to the full/saturated model. Analysis of deviance compares 54.03 on 1 degree of freedom, indicating the interaction term is highly significant.

  • Therefore our conclusion is that presence of particle in the die significantly affects the quality of wafer.

1.2 Multinomial model

  • If we assume the total sample size is fixed at \(n=450\). Then we can model the counts by a multinomial model. We consdier two models

    • \(H_1\): one full/saturated model where each cell of the \(2 \times 2\) has its own probability \(p_{ij}\), and

    • \(H_0\): one reduced model that assumes particle is independent of quality, thus \(p_{ij} = p_i p_j\).

  • For the full/saturated model \(H_1\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j p_{ij}^{y_{ij}}. \] We estimate \(p_{ij}\) by the cell proportion (MLE) \[ \widehat{p}_{ij} = \frac{y_{ij}}{n}. \]

  • For the reduced model assuming independence \(H_0\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j (p_i p_j)^{y_{ij}} = \frac{n!}{\prod_i \prod_j y_{ij}!} p_i^{\sum_j y_{ij}} p_j^{\sum_i y_{ij}}. \] The MLE is \[ \widehat{p}_i = \frac{\sum_j y_{ij}}{n}, \quad \widehat{p}_j = \frac{\sum_i y_{ij}}{n}. \]

  • The analysis deviance compares the deviance \[\begin{eqnarray*} D &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n} - 2 \sum_i \left(\sum_j y_{ij}\right) \log \widehat{p}_i - 2 \sum_j \left(\sum_i y_{ij}\right) \log \widehat{p}_j \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n \widehat{p}_i \widehat{p}_j} \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{\widehat{\mu}_{ij}} \end{eqnarray*}\] to 1 degree of freedom.

(partp <- xtabs(y ~ particle, data = wafer) %>% prop.table())
particle
       no       yes 
0.8888889 0.1111111 
(qualp <- xtabs(y ~ quality, data = wafer)  %>% prop.table())
quality
     good       bad 
0.7422222 0.2577778 
muhat <- 450 * outer(partp, qualp)
ytable <- xtabs(y ~ particle + quality, data = wafer)
2 * sum(ytable * log(ytable / muhat))
[1] 54.03045
  • We get the exact same result as the analysis of deviance in the Poisson model.

  • This connection between Poisson and multinomial is no surprise due to the following fact. If \(Y_1, \ldots, Y_k\) are independent Poisson with means \(\lambda_1, \ldots, \lambda_k\), then the joint distribution of \(Y_1, \ldots, Y_k \mid \sum_i Y_i = n\) is multinomial with probabilities \(p_j = \lambda_j / \sum_i \lambda_i\).

1.3 Binomial

  • If we view particle as a predictor affecting whether a wafer is good quality or bad quality, we end up with an independent binomial model.

  • The null (intercept-only) model corresponds to the hypothesis particle does not affect quality.

tibble(good     = c(320, 14),
       bad      = c(80, 36),
       particle = c("no", "yes")) %>%
  print() %>%
  glm(cbind(good, bad) ~ 1, family = binomial, data = .) %>%
  summary()
# A tibble: 2 × 3
   good   bad particle
  <dbl> <dbl> <chr>   
1   320    80 no      
2    14    36 yes     

Call:
glm(formula = cbind(good, bad) ~ 1, family = binomial, data = .)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.0576     0.1078   9.813   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 54.03  on 1  degrees of freedom
Residual deviance: 54.03  on 1  degrees of freedom
AIC: 66.191

Number of Fisher Scoring iterations: 3

The alternative model corresponds to the full/saturated model where particle is included as a predictor.

  • Again we observe the exactly same analysis of deviance inference.

  • When there are more than two rows or columns, this model is called the product binoimal/multinomial model.

1.4 Hypergeometric

  • Finally if we fix both row and column marginal totals, the probability of the observed table is \[\begin{eqnarray*} p(y_{11}, y_{12}, y_{21}, y_{22}) &=& \frac{\binom{y_{1\cdot}}{y_{11}} \binom{y_{2\cdot}}{y_{22}} \binom{y_{\cdot 1}}{y_{21}} \binom{y_{\cdot 2}}{y_{12}}}{\binom{n}{y_{11} \, y_{12} \, y_{21} \, y_{22}}} \\ &=& \frac{y_{1\cdot}! y_{2\cdot}! y_{\cdot 1}! y_{\cdot 2}!}{y_{11}! y_{12}! y_{21}! y_{22}! n!} \\ &=& \frac{\binom{y_{1 \cdot}}{y_{11}} \binom{y_{2 \cdot}}{y_{\cdot 1} - y_{11}}}{\binom{n}{y_{\cdot 1}}}. \end{eqnarray*}\]

  • Under the null hypothesis that particle is not associated with quality, the Fisher’s exact test calculates the p-value by summing over the probabilities of tables with more extreme observations \[ \sum_{y_{11} \ge 320} p(y_{11}, y_{12}, y_{21}, y_{22}). \]

fisher.test(ytable)

    Fisher's Exact Test for Count Data

data:  ytable
p-value = 2.955e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  5.090628 21.544071
sample estimates:
odds ratio 
  10.21331 

The odds ratio is \[ \frac{\pi_{\text{no particle}} / (1 - \pi_{\text{no particle}})}{\pi_{\text{particle}} / (1 - \pi_{\text{particle}})} = \frac{y_{11} y_{22}}{y_{12} y_{21}} \]

(320 * 36) / (14 * 80)
[1] 10.28571

2 Larger two-way tables

  • The haireye data set contains data on 592 statistics students cross-classifed by hair and eye color.
haireye <- as_tibble(haireye) %>%
  print(n = Inf)
# A tibble: 16 × 3
       y eye   hair 
   <dbl> <fct> <fct>
 1     5 green BLACK
 2    29 green BROWN
 3    14 green RED  
 4    16 green BLOND
 5    15 hazel BLACK
 6    54 hazel BROWN
 7    14 hazel RED  
 8    10 hazel BLOND
 9    20 blue  BLACK
10    84 blue  BROWN
11    17 blue  RED  
12    94 blue  BLOND
13    68 brown BLACK
14   119 brown BROWN
15    26 brown RED  
16     7 brown BLOND
(haireye_table <- xtabs(y ~ hair + eye, data = haireye))
       eye
hair    green hazel blue brown
  BLACK     5    15   20    68
  BROWN    29    54   84   119
  RED      14    14   17    26
  BLOND    16    10   94     7
  • Graphical summary of the contingency table by a mosaic plot. If eye and hair are independent, we expect to see a grid.
mosaicplot(haireye_table, color = TRUE, main = NULL, las = 1)

  • Pearson \(X^2\) test for independence yields
summary(haireye_table)
Call: xtabs(formula = y ~ hair + eye, data = haireye)
Number of cases in table: 592 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 138.29, df = 9, p-value = 2.325e-25
  • We fit a Poisson GLM:
    modc <- glm(y ~ hair + eye, family = poisson, data = haireye)
    summary(modc)

Call:
glm(formula = y ~ hair + eye, family = poisson, data = haireye)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.4575     0.1523  16.136  < 2e-16 ***
hairBROWN     0.9739     0.1129   8.623  < 2e-16 ***
hairRED      -0.4195     0.1528  -2.745  0.00604 ** 
hairBLOND     0.1621     0.1309   1.238  0.21569    
eyehazel      0.3737     0.1624   2.301  0.02139 *  
eyeblue       1.2118     0.1424   8.510  < 2e-16 ***
eyebrown      1.2347     0.1420   8.694  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 453.31  on 15  degrees of freedom
Residual deviance: 146.44  on  9  degrees of freedom
AIC: 241.04

Number of Fisher Scoring iterations: 5

which clearly shows a lack of fit. The interaction model is equivalent to the full/saturated model. Therefore we see strong evidence for the dependence between eye and hair.

3 Three-way contingency tables

  • The femsoke data records female smokers and non-smokers, their age group, and whether the subjects were dead or still alive after 20 years.
femsmoke <- as_tibble(femsmoke) %>%
  print(n = Inf)
# A tibble: 28 × 4
       y smoker dead  age  
   <dbl> <fct>  <fct> <fct>
 1     2 yes    yes   18-24
 2     1 no     yes   18-24
 3     3 yes    yes   25-34
 4     5 no     yes   25-34
 5    14 yes    yes   35-44
 6     7 no     yes   35-44
 7    27 yes    yes   45-54
 8    12 no     yes   45-54
 9    51 yes    yes   55-64
10    40 no     yes   55-64
11    29 yes    yes   65-74
12   101 no     yes   65-74
13    13 yes    yes   75+  
14    64 no     yes   75+  
15    53 yes    no    18-24
16    61 no     no    18-24
17   121 yes    no    25-34
18   152 no     no    25-34
19    95 yes    no    35-44
20   114 no     no    35-44
21   103 yes    no    45-54
22    66 no     no    45-54
23    64 yes    no    55-64
24    81 no     no    55-64
25     7 yes    no    65-74
26    28 no     no    65-74
27     0 yes    no    75+  
28     0 no     no    75+  

There are three factors smoker, dead, and age. If we just classify over smoker and dead

(ct <- xtabs(y ~ smoker + dead, data = femsmoke))
      dead
smoker yes  no
   yes 139 443
   no  230 502
prop.table(ct, 1)
      dead
smoker       yes        no
   yes 0.2388316 0.7611684
   no  0.3142077 0.6857923

we see significantly higher percentage of non-smokers died than smokers.

summary(ct)
Call: xtabs(formula = y ~ smoker + dead, data = femsmoke)
Number of cases in table: 1314 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 9.121, df = 1, p-value = 0.002527
  • However smoke status is confounded with the age group. Smokers are more concentrated in the younger groups and younger people are more likely to live for another 20 years.
prop.table(xtabs(y ~ smoker + age, data = femsmoke), 2)
      age
smoker     18-24     25-34     35-44     45-54     55-64     65-74       75+
   yes 0.4700855 0.4412811 0.4739130 0.6250000 0.4872881 0.2181818 0.1688312
   no  0.5299145 0.5587189 0.5260870 0.3750000 0.5127119 0.7818182 0.8311688
  • The odds ratios in all age groups are:
ct3 <- xtabs(y ~ smoker + dead + age, data = femsmoke)
apply(ct3, 3, function (x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))
   18-24    25-34    35-44    45-54    55-64    65-74      75+ 
2.301887 0.753719 2.400000 1.441748 1.613672 1.148515      NaN 
  • The Cochran-Mantel-Haenszel (CMH) tests independence in \(2 \times 2\) tables across \(K\) strata. It compares the statistic \[ \frac{(|\sum_k y_{11k} - \sum_k \mathbb{E}y_{11k}| - 1/2)^2}{\sum_k \operatorname{Var} y_{11k}}, \] where the expectation and variance are computed under the null hypothesis of independence in each stratum, to asymptotic null distribution \(\chi_1^2\) and calculate the p-value by exact test.
mantelhaen.test(ct3, exact = TRUE)

    Exact conditional test of independence in 2 x 2 x k tables

data:  ct3
S = 139, p-value = 0.01591
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
 1.068889 2.203415
sample estimates:
common odds ratio 
         1.530256 

3.1 Mutual independence model

  • Under mutual independence, \[ p_{ijk} = p_i p_j p_k \] so \[ \mathbb{E} Y_{ijk} = n p_{ijk} \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k. \] So the main effect-only model corresponds to the mutual independence model.

  • We can test independence by the Pearson \(X^2\) test

summary(ct3)
Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke)
Number of cases in table: 1314 
Number of factors: 3 
Test for independence of all factors:
    Chisq = 790.6, df = 19, p-value = 2.14e-155

or by a Poisson GLM

modi <- glm(y ~ smoker + dead + age, family = poisson, data = femsmoke)
summary(modi)

Call:
glm(formula = y ~ smoker + dead + age, family = poisson, data = femsmoke)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.67778    0.10702  25.021  < 2e-16 ***
smokerno     0.22931    0.05554   4.129 3.64e-05 ***
deadno       0.94039    0.06139  15.319  < 2e-16 ***
age25-34     0.87618    0.11003   7.963 1.67e-15 ***
age35-44     0.67591    0.11356   5.952 2.65e-09 ***
age45-54     0.57536    0.11556   4.979 6.40e-07 ***
age55-64     0.70166    0.11307   6.206 5.45e-10 ***
age65-74     0.34377    0.12086   2.844  0.00445 ** 
age75+      -0.41837    0.14674  -2.851  0.00436 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1193.9  on 27  degrees of freedom
Residual deviance:  735.0  on 19  degrees of freedom
AIC: 887.2

Number of Fisher Scoring iterations: 6

Both suggest a lack of fit.

modi %>% tbl_regression(exponentiate = TRUE, intercept = TRUE)
Characteristic IRR 95% CI p-value
(Intercept) 14.6 11.7, 17.9 <0.001
smoker


    yes
    no 1.26 1.13, 1.40 <0.001
dead


    yes
    no 2.56 2.27, 2.89 <0.001
age


    18-24
    25-34 2.40 1.94, 2.99 <0.001
    35-44 1.97 1.58, 2.46 <0.001
    45-54 1.78 1.42, 2.24 <0.001
    55-64 2.02 1.62, 2.53 <0.001
    65-74 1.41 1.11, 1.79 0.004
    75+ 0.66 0.49, 0.88 0.004
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

The exponentiated coefficients are just marginal proportions

c(1, 1.26) / (1 + 1.26)
[1] 0.4424779 0.5575221
prop.table(xtabs(y ~ smoker, femsmoke))
smoker
      yes        no 
0.4429224 0.5570776 

3.2 Joint independence

  • If we assume the first two variables are dependent, and jointly independent of the third. Then \[ p_{ijk} = p_{ij} p_k \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_{ij} + \log p_k. \]
  • It leads to a log-linear model with just one interaction
    glm(y ~ smoker * dead + age, family = poisson, data = femsmoke) %>%
    summary()

Call:
glm(formula = y ~ smoker * dead + age, family = poisson, data = femsmoke)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      2.51582    0.12239  20.555  < 2e-16 ***
smokerno         0.50361    0.10743   4.688 2.76e-06 ***
deadno           1.15910    0.09722  11.922  < 2e-16 ***
age25-34         0.87618    0.11003   7.963 1.67e-15 ***
age35-44         0.67591    0.11356   5.952 2.65e-09 ***
age45-54         0.57536    0.11556   4.979 6.40e-07 ***
age55-64         0.70166    0.11307   6.206 5.45e-10 ***
age65-74         0.34377    0.12086   2.844  0.00445 ** 
age75+          -0.41837    0.14674  -2.851  0.00436 ** 
smokerno:deadno -0.37858    0.12566  -3.013  0.00259 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1193.9  on 27  degrees of freedom
Residual deviance:  725.8  on 18  degrees of freedom
AIC: 880

Number of Fisher Scoring iterations: 6
It improves the fit, but still not enough.

3.3 Conditional independence

  • Let \(P_{ij \mid k}\) be the probability that an observation falls in \((i, j)\)-cell given that we know the third variables takes value \(k\). If we assume that first two variables are independent give value of the third. Then \[ p_{ijk} = p_{ij\mid k} p_k = p_{i\mid k} p_{j \mid k} p_k = \frac{p_{ik} p_{jk}}{p_k}. \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_{ik} + \log p_{jk} - \log p_k. \]

  • It leads to a model with two interaction terms

glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
  summary()

Call:
glm(formula = y ~ smoker * age + dead * age, family = poisson, 
    data = femsmoke)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.34377    0.58563   0.587 0.557199    
smokerno             0.11980    0.18523   0.647 0.517785    
age25-34             0.91760    0.68737   1.335 0.181895    
age35-44             1.95402    0.62882   3.107 0.001887 ** 
age45-54             2.84979    0.60950   4.676 2.93e-06 ***
age55-64             3.44819    0.59868   5.760 8.43e-09 ***
age65-74             3.00134    0.61023   4.918 8.73e-07 ***
age75+               2.22118    0.64799   3.428 0.000609 ***
deadno               3.63759    0.58490   6.219 5.00e-10 ***
smokerno:age25-34    0.11616    0.22078   0.526 0.598789    
smokerno:age35-44   -0.01536    0.22749  -0.068 0.946172    
smokerno:age45-54   -0.63063    0.23414  -2.693 0.007074 ** 
smokerno:age55-64   -0.06894    0.22643  -0.304 0.760765    
smokerno:age65-74    1.15649    0.26427   4.376 1.21e-05 ***
smokerno:age75+      1.47413    0.35617   4.139 3.49e-05 ***
age25-34:deadno     -0.10756    0.68613  -0.157 0.875435    
age35-44:deadno     -1.33977    0.62810  -2.133 0.032920 *  
age45-54:deadno     -2.17125    0.61128  -3.552 0.000382 ***
age55-64:deadno     -3.17171    0.59999  -5.286 1.25e-07 ***
age65-74:deadno     -4.94977    0.61512  -8.047 8.49e-16 ***
age75+:deadno      -26.30450 5776.51889  -0.005 0.996367    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1193.9378  on 27  degrees of freedom
Residual deviance:    8.3269  on  7  degrees of freedom
AIC: 184.52

Number of Fisher Scoring iterations: 17

This model fits data pretty well.

  • Significance of predictors
glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
drop1(test = "Chi")
Single term deletions

Model:
y ~ smoker * age + dead * age
           Df Deviance    AIC    LRT  Pr(>Chi)    
<none>            8.33 184.52                     
smoker:age  6   101.83 266.03  93.51 < 2.2e-16 ***
age:dead    6   641.50 805.69 633.17 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4 Uniform association

  • If we consider a model with all two-way interactions \[ \log \mathbb{E}Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k + \log p_{ij} + \log p_{ik} + \log p_{jk}. \]
  • There is no simple interpretation in terms of independence.
modu <- glm(y ~ (smoker + age + dead)^2, family = poisson, data = femsmoke)
summary(modu)

Call:
glm(formula = y ~ (smoker + age + dead)^2, family = poisson, 
    data = femsmoke)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.54284    0.58736   0.924 0.355384    
smokerno            -0.29666    0.25324  -1.171 0.241401    
age25-34             0.92902    0.68381   1.359 0.174273    
age35-44             1.94048    0.62486   3.105 0.001900 ** 
age45-54             2.76845    0.60657   4.564 5.02e-06 ***
age55-64             3.37507    0.59550   5.668 1.45e-08 ***
age65-74             2.86586    0.60894   4.706 2.52e-06 ***
age75+               2.02211    0.64955   3.113 0.001851 ** 
deadno               3.43271    0.59014   5.817 6.00e-09 ***
smokerno:age25-34    0.11752    0.22091   0.532 0.594749    
smokerno:age35-44    0.01268    0.22800   0.056 0.955654    
smokerno:age45-54   -0.56538    0.23585  -2.397 0.016522 *  
smokerno:age55-64    0.08512    0.23573   0.361 0.718030    
smokerno:age65-74    1.49088    0.30039   4.963 6.93e-07 ***
smokerno:age75+      1.89060    0.39582   4.776 1.78e-06 ***
smokerno:deadno      0.42741    0.17703   2.414 0.015762 *  
age25-34:deadno     -0.12006    0.68655  -0.175 0.861178    
age35-44:deadno     -1.34112    0.62857  -2.134 0.032874 *  
age45-54:deadno     -2.11336    0.61210  -3.453 0.000555 ***
age55-64:deadno     -3.18077    0.60057  -5.296 1.18e-07 ***
age65-74:deadno     -5.08798    0.61951  -8.213  < 2e-16 ***
age75+:deadno      -27.31727 8839.01146  -0.003 0.997534    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1193.9378  on 27  degrees of freedom
Residual deviance:    2.3809  on  6  degrees of freedom
AIC: 180.58

Number of Fisher Scoring iterations: 18
  • If we compute the odds ratio for each age group
xtabs(fitted(modu) ~ smoker + dead + age, data = femsmoke) %>%
apply(3, function(x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))
   18-24    25-34    35-44    45-54    55-64    65-74      75+ 
1.533275 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275 

Thus the name uniform association model.

  • The odds ratio can also be extracted from the coefficients
exp(coef(modu)['smokerno:deadno'])
smokerno:deadno 
       1.533275 

3.5 Model selection

  • We can start from the saturated model (3-way interaction) and use the analysis of deviance to see how much we can reduce the model.
glm(y ~ smoker * age * dead, family = poisson, data = femsmoke) %>%
  step(test = "Chi") %>%
  drop1(test = "Chi")
Start:  AIC=190.19
y ~ smoker * age * dead

                  Df Deviance    AIC    LRT Pr(>Chi)
- smoker:age:dead  6   2.3809 180.58 2.3809   0.8815
<none>                 0.0000 190.19                

Step:  AIC=180.58
y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead

              Df Deviance    AIC    LRT Pr(>Chi)    
<none>               2.38 180.58                    
- smoker:dead  1     8.33 184.52   5.95  0.01475 *  
- smoker:age   6    92.63 258.83  90.25  < 2e-16 ***
- age:dead     6   632.30 798.49 629.92  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Single term deletions

Model:
y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead
            Df Deviance    AIC    LRT Pr(>Chi)    
<none>             2.38 180.58                    
smoker:age   6    92.63 258.83  90.25  < 2e-16 ***
smoker:dead  1     8.33 184.52   5.95  0.01475 *  
age:dead     6   632.30 798.49 629.92  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction term smoker:dead is weakly significant in the final model. This corresponds to the test of conditional independence between smoke and dead given age group. The p-value is very similar to that from the CMH test.

3.6 Binomial

  • It also makes intuitive sense to model life status as the outcome depending on perdictors smoker and age.
# glm(dead ~ smoker * age, family = binomial, weights = y, data = femsmoke) %>%
#   summary()
glm(matrix(femsmoke$y, ncol = 2) ~ smoker * age, family = binomial, femsmoke[1:14, ]) %>%
  step(test = "Chi") %>%
  summary()
Start:  AIC=75
matrix(femsmoke$y, ncol = 2) ~ smoker * age

             Df Deviance    AIC    LRT Pr(>Chi)
- smoker:age  6   2.3809 65.377 2.3809   0.8815
<none>            0.0000 74.996                

Step:  AIC=65.38
matrix(femsmoke$y, ncol = 2) ~ smoker + age

         Df Deviance    AIC    LRT Pr(>Chi)    
<none>          2.38  65.38                    
- smoker  1     8.33  69.32   5.95  0.01475 *  
- age     6   632.30 683.29 629.92  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Call:
glm(formula = matrix(femsmoke$y, ncol = 2) ~ smoker + age, family = binomial, 
    data = femsmoke[1:14, ])

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -3.4327     0.5901  -5.817 6.00e-09 ***
smokerno       -0.4274     0.1770  -2.414 0.015762 *  
age25-34        0.1201     0.6865   0.175 0.861178    
age35-44        1.3411     0.6286   2.134 0.032874 *  
age45-54        2.1134     0.6121   3.453 0.000555 ***
age55-64        3.1808     0.6006   5.296 1.18e-07 ***
age65-74        5.0880     0.6195   8.213  < 2e-16 ***
age75+         27.8073 11293.1430   0.002 0.998035    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 641.4963  on 13  degrees of freedom
Residual deviance:   2.3809  on  6  degrees of freedom
AIC: 65.377

Number of Fisher Scoring iterations: 20
  • The final model retains only main effects. This is equivalent to the uniform association model. The deviance is exactly the same.