Contingency Tables (ELMR Chapter 6)

Biostat 200C

Author

Dr. Jin Zhou @ UCLA

Published

May 4, 2025

Display system information and load tidyverse and faraway packages

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.1    fastmap_1.2.0     cli_3.6.4        
 [5] tools_4.4.1       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.49        jsonlite_1.9.1    xfun_0.51        
[13] digest_0.6.37     rlang_1.1.5       evaluate_1.0.3

library(tidyverse)
library(faraway)

faraway package contains the datasets in the ELMR book.

1 Two-by-two tables

Image source: https://www.gep.com/mind/blog/outlook-for-the-global-semiconductor-silicon-wafer-industry

Semiconductor wafer quality data. The research question is whether the presence of particles on the die affects the quality of the wafer being produced.

(wafer <- tibble(
  y        = c(320, 14, 80, 36),
  particle = gl(2, 1, length = 4, labels = c("no", "yes")),
  quality  = gl(2, 2, labels = c("good", "bad"))))

# A tibble: 4 × 3
      y particle quality
  <dbl> <fct>    <fct>  
1   320 no       good   
2    14 yes      good   
3    80 no       bad    
4    36 yes      bad

This kind of data is conveniently presented as a \(2 \times 2\) contingency table.

(ytable <- xtabs(y ~ particle + quality, data = wafer))

        quality
particle good bad
     no   320  80
     yes   14  36

The usual Pearson \(X^2\) test compares the test statistic \[ X^2 = \sum_{i, j} \frac{(y_{ij} - \widehat \mu_{ij})^2}{\widehat \mu_{ij}}, \] where \[ \widehat \mu_{ij} = n \widehat p_i \widehat p_j = n \frac{\sum_j y_{ij}}{n} \frac{\sum_i y_{ij}}{n}, \] to \(\chi_1^2\).

summary(ytable)

Call: xtabs(formula = y ~ particle + quality, data = wafer)
Number of cases in table: 450 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 62.81, df = 1, p-value = 2.274e-15

We will analyze this data using 4 different models according to 4 different sampling schemes. They all reach the same conclusion.
1. We observed the manufacturing process for a certain period of time and observed 450 wafers. The data were then cross-classified. We could use a Poisson model.
2. We decided to sample 450 wafers. The data were then cross-classified. We could use a multinomial model.
3. We selected 400 wafers without particles and 50 wafers with particles and then recorded the good or bad outcome. We could use a binomial model.
4. We selected 400 wafers without particles and 50 wafers with particles that also included, by design, 334 good wafers and 116 bad ones. We could use a hypergeometric model.

1.1 Poisson model

We observe the manufacturing process for a certain period of time and observe 450 wafers. The data are then cross-classified. We model the observed counts by Poisson.

library(gtsummary)
glm(y ~ particle + quality, family = poisson, data = wafer) %>%
  #tbl_regression()
  summary()


Call:
glm(formula = y ~ particle + quality, family = poisson, data = wafer)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   5.6934     0.0572  99.535   <2e-16 ***
particleyes  -2.0794     0.1500 -13.863   <2e-16 ***
qualitybad   -1.0575     0.1078  -9.813   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 474.10  on 3  degrees of freedom
Residual deviance:  54.03  on 1  degrees of freedom
AIC: 83.774

Number of Fisher Scoring iterations: 5

The null (intercept-only model) model assumes that the rate is constant across particle and quality levels. The analysis of deviance compares \(474.10 - 54.03\) to 2 degrees of freedom, indicating null model is rejected in favor of our two predictor Poisson model.
If we add an interaction term particle * quality, the model costs 4 parameters, the same number as the observations. It is equivalent to the full/saturated model. Analysis of deviance compares 54.03 on 1 degree of freedom, indicating the interaction term is highly significant.
Therefore our conclusion is that presence of particle in the die significantly affects the quality of wafer.

1.2 Multinomial model

If we assume the total sample size is fixed at \(n=450\). Then we can model the counts by a multinomial model. We consdier two models
- \(H_1\): one full/saturated model where each cell of the \(2 \times 2\) has its own probability \(p_{ij}\), and
- \(H_0\): one reduced model that assumes particle is independent of quality, thus \(p_{ij} = p_i p_j\).
For the full/saturated model \(H_1\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j p_{ij}^{y_{ij}}. \] We estimate \(p_{ij}\) by the cell proportion (MLE) \[ \widehat{p}_{ij} = \frac{y_{ij}}{n}. \]
For the reduced model assuming independence \(H_0\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j (p_i p_j)^{y_{ij}} = \frac{n!}{\prod_i \prod_j y_{ij}!} p_i^{\sum_j y_{ij}} p_j^{\sum_i y_{ij}}. \] The MLE is \[ \widehat{p}_i = \frac{\sum_j y_{ij}}{n}, \quad \widehat{p}_j = \frac{\sum_i y_{ij}}{n}. \]
The analysis deviance compares the deviance \[\begin{eqnarray*} D &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n} - 2 \sum_i \left(\sum_j y_{ij}\right) \log \widehat{p}_i - 2 \sum_j \left(\sum_i y_{ij}\right) \log \widehat{p}_j \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n \widehat{p}_i \widehat{p}_j} \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{\widehat{\mu}_{ij}} \end{eqnarray*}\] to 1 degree of freedom.

(partp <- xtabs(y ~ particle, data = wafer) %>% prop.table())

particle
       no       yes 
0.8888889 0.1111111

(qualp <- xtabs(y ~ quality, data = wafer)  %>% prop.table())

quality
     good       bad 
0.7422222 0.2577778

muhat <- 450 * outer(partp, qualp)
ytable <- xtabs(y ~ particle + quality, data = wafer)
2 * sum(ytable * log(ytable / muhat))

[1] 54.03045

We get the exact same result as the analysis of deviance in the Poisson model.
This connection between Poisson and multinomial is no surprise due to the following fact. If \(Y_1, \ldots, Y_k\) are independent Poisson with means \(\lambda_1, \ldots, \lambda_k\), then the joint distribution of \(Y_1, \ldots, Y_k \mid \sum_i Y_i = n\) is multinomial with probabilities \(p_j = \lambda_j / \sum_i \lambda_i\).

1.3 Binomial

If we view particle as a predictor affecting whether a wafer is good quality or bad quality, we end up with an independent binomial model.
The null (intercept-only) model corresponds to the hypothesis particle does not affect quality.

tibble(good     = c(320, 14),
       bad      = c(80, 36),
       particle = c("no", "yes")) %>%
  print() %>%
  glm(cbind(good, bad) ~ 1, family = binomial, data = .) %>%
  summary()

# A tibble: 2 × 3
   good   bad particle
  <dbl> <dbl> <chr>   
1   320    80 no      
2    14    36 yes


Call:
glm(formula = cbind(good, bad) ~ 1, family = binomial, data = .)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.0576     0.1078   9.813   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 54.03  on 1  degrees of freedom
Residual deviance: 54.03  on 1  degrees of freedom
AIC: 66.191

Number of Fisher Scoring iterations: 3

The alternative model corresponds to the full/saturated model where particle is included as a predictor.

Again we observe the exactly same analysis of deviance inference.
When there are more than two rows or columns, this model is called the product binoimal/multinomial model.

1.4 Hypergeometric

Finally if we fix both row and column marginal totals, the probability of the observed table is \[\begin{eqnarray*} p(y_{11}, y_{12}, y_{21}, y_{22}) &=& \frac{\binom{y_{1\cdot}}{y_{11}} \binom{y_{2\cdot}}{y_{22}} \binom{y_{\cdot 1}}{y_{21}} \binom{y_{\cdot 2}}{y_{12}}}{\binom{n}{y_{11} \, y_{12} \, y_{21} \, y_{22}}} \\ &=& \frac{y_{1\cdot}! y_{2\cdot}! y_{\cdot 1}! y_{\cdot 2}!}{y_{11}! y_{12}! y_{21}! y_{22}! n!} \\ &=& \frac{\binom{y_{1 \cdot}}{y_{11}} \binom{y_{2 \cdot}}{y_{\cdot 1} - y_{11}}}{\binom{n}{y_{\cdot 1}}}. \end{eqnarray*}\]
Under the null hypothesis that particle is not associated with quality, the Fisher’s exact test calculates the p-value by summing over the probabilities of tables with more extreme observations \[ \sum_{y_{11} \ge 320} p(y_{11}, y_{12}, y_{21}, y_{22}). \]

fisher.test(ytable)


    Fisher's Exact Test for Count Data

data:  ytable
p-value = 2.955e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  5.090628 21.544071
sample estimates:
odds ratio 
  10.21331

The odds ratio is \[ \frac{\pi_{\text{no particle}} / (1 - \pi_{\text{no particle}})}{\pi_{\text{particle}} / (1 - \pi_{\text{particle}})} = \frac{y_{11} y_{22}}{y_{12} y_{21}} \]

(320 * 36) / (14 * 80)

[1] 10.28571

2 Larger two-way tables

The haireye data set contains data on 592 statistics students cross-classifed by hair and eye color.

haireye <- as_tibble(haireye) %>%
  print(n = Inf)

# A tibble: 16 × 3
       y eye   hair 
   <dbl> <fct> <fct>
 1     5 green BLACK
 2    29 green BROWN
 3    14 green RED  
 4    16 green BLOND
 5    15 hazel BLACK
 6    54 hazel BROWN
 7    14 hazel RED  
 8    10 hazel BLOND
 9    20 blue  BLACK
10    84 blue  BROWN
11    17 blue  RED  
12    94 blue  BLOND
13    68 brown BLACK
14   119 brown BROWN
15    26 brown RED  
16     7 brown BLOND

(haireye_table <- xtabs(y ~ hair + eye, data = haireye))

       eye
hair    green hazel blue brown
  BLACK     5    15   20    68
  BROWN    29    54   84   119
  RED      14    14   17    26
  BLOND    16    10   94     7

Graphical summary of the contingency table by a mosaic plot. If eye and hair are independent, we expect to see a grid.

mosaicplot(haireye_table, color = TRUE, main = NULL, las = 1)

Pearson \(X^2\) test for independence yields

summary(haireye_table)

Call: xtabs(formula = y ~ hair + eye, data = haireye)
Number of cases in table: 592 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 138.29, df = 9, p-value = 2.325e-25

We fit a Poisson GLM:

    modc <- glm(y ~ hair + eye, family = poisson, data = haireye)
    summary(modc)


Call:
glm(formula = y ~ hair + eye, family = poisson, data = haireye)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.4575     0.1523  16.136  < 2e-16 ***
hairBROWN     0.9739     0.1129   8.623  < 2e-16 ***
hairRED      -0.4195     0.1528  -2.745  0.00604 ** 
hairBLOND     0.1621     0.1309   1.238  0.21569    
eyehazel      0.3737     0.1624   2.301  0.02139 *  
eyeblue       1.2118     0.1424   8.510  < 2e-16 ***
eyebrown      1.2347     0.1420   8.694  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 453.31  on 15  degrees of freedom
Residual deviance: 146.44  on  9  degrees of freedom
AIC: 241.04

Number of Fisher Scoring iterations: 5

which clearly shows a lack of fit. The interaction model is equivalent to the full/saturated model. Therefore we see strong evidence for the dependence between eye and hair.

3 Three-way contingency tables

The femsoke data records female smokers and non-smokers, their age group, and whether the subjects were dead or still alive after 20 years.

femsmoke <- as_tibble(femsmoke) %>%
  print(n = Inf)

# A tibble: 28 × 4
       y smoker dead  age  
   <dbl> <fct>  <fct> <fct>
 1     2 yes    yes   18-24
 2     1 no     yes   18-24
 3     3 yes    yes   25-34
 4     5 no     yes   25-34
 5    14 yes    yes   35-44
 6     7 no     yes   35-44
 7    27 yes    yes   45-54
 8    12 no     yes   45-54
 9    51 yes    yes   55-64
10    40 no     yes   55-64
11    29 yes    yes   65-74
12   101 no     yes   65-74
13    13 yes    yes   75+  
14    64 no     yes   75+  
15    53 yes    no    18-24
16    61 no     no    18-24
17   121 yes    no    25-34
18   152 no     no    25-34
19    95 yes    no    35-44
20   114 no     no    35-44
21   103 yes    no    45-54
22    66 no     no    45-54
23    64 yes    no    55-64
24    81 no     no    55-64
25     7 yes    no    65-74
26    28 no     no    65-74
27     0 yes    no    75+  
28     0 no     no    75+

There are three factors smoker, dead, and age. If we just classify over smoker and dead

(ct <- xtabs(y ~ smoker + dead, data = femsmoke))

      dead
smoker yes  no
   yes 139 443
   no  230 502

prop.table(ct, 1)

      dead
smoker       yes        no
   yes 0.2388316 0.7611684
   no  0.3142077 0.6857923

we see significantly higher percentage of non-smokers died than smokers.

summary(ct)

Call: xtabs(formula = y ~ smoker + dead, data = femsmoke)
Number of cases in table: 1314 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 9.121, df = 1, p-value = 0.002527

However smoke status is confounded with the age group. Smokers are more concentrated in the younger groups and younger people are more likely to live for another 20 years.

prop.table(xtabs(y ~ smoker + age, data = femsmoke), 2)

      age
smoker     18-24     25-34     35-44     45-54     55-64     65-74       75+
   yes 0.4700855 0.4412811 0.4739130 0.6250000 0.4872881 0.2181818 0.1688312
   no  0.5299145 0.5587189 0.5260870 0.3750000 0.5127119 0.7818182 0.8311688

The odds ratios in all age groups are:

ct3 <- xtabs(y ~ smoker + dead + age, data = femsmoke)
apply(ct3, 3, function (x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))

   18-24    25-34    35-44    45-54    55-64    65-74      75+ 
2.301887 0.753719 2.400000 1.441748 1.613672 1.148515      NaN

The Cochran-Mantel-Haenszel (CMH) tests independence in \(2 \times 2\) tables across \(K\) strata. It compares the statistic \[ \frac{(|\sum_k y_{11k} - \sum_k \mathbb{E}y_{11k}| - 1/2)^2}{\sum_k \operatorname{Var} y_{11k}}, \] where the expectation and variance are computed under the null hypothesis of independence in each stratum, to asymptotic null distribution \(\chi_1^2\) and calculate the p-value by exact test.

mantelhaen.test(ct3, exact = TRUE)


    Exact conditional test of independence in 2 x 2 x k tables

data:  ct3
S = 139, p-value = 0.01591
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
 1.068889 2.203415
sample estimates:
common odds ratio 
         1.530256

3.1 Mutual independence model

Under mutual independence, \[ p_{ijk} = p_i p_j p_k \] so \[ \mathbb{E} Y_{ijk} = n p_{ijk} \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k. \] So the main effect-only model corresponds to the mutual independence model.
We can test independence by the Pearson \(X^2\) test

summary(ct3)

Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke)
Number of cases in table: 1314 
Number of factors: 3 
Test for independence of all factors:
    Chisq = 790.6, df = 19, p-value = 2.14e-155

or by a Poisson GLM

modi <- glm(y ~ smoker + dead + age, family = poisson, data = femsmoke)
summary(modi)


Call:
glm(formula = y ~ smoker + dead + age, family = poisson, data = femsmoke)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.67778    0.10702  25.021  < 2e-16 ***
smokerno     0.22931    0.05554   4.129 3.64e-05 ***
deadno       0.94039    0.06139  15.319  < 2e-16 ***
age25-34     0.87618    0.11003   7.963 1.67e-15 ***
age35-44     0.67591    0.11356   5.952 2.65e-09 ***
age45-54     0.57536    0.11556   4.979 6.40e-07 ***
age55-64     0.70166    0.11307   6.206 5.45e-10 ***
age65-74     0.34377    0.12086   2.844  0.00445 ** 
age75+      -0.41837    0.14674  -2.851  0.00436 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1193.9  on 27  degrees of freedom
Residual deviance:  735.0  on 19  degrees of freedom
AIC: 887.2

Number of Fisher Scoring iterations: 6

Both suggest a lack of fit.

modi %>% tbl_regression(exponentiate = TRUE, intercept = TRUE)

Characteristic	IRR	95% CI	p-value
(Intercept)	14.6	11.7, 17.9	<0.001
smoker
yes	—	—
no	1.26	1.13, 1.40	<0.001
dead
yes	—	—
no	2.56	2.27, 2.89	<0.001
age
18-24	—	—
25-34	2.40	1.94, 2.99	<0.001
35-44	1.97	1.58, 2.46	<0.001
45-54	1.78	1.42, 2.24	<0.001
55-64	2.02	1.62, 2.53	<0.001
65-74	1.41	1.11, 1.79	0.004
75+	0.66	0.49, 0.88	0.004
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

The exponentiated coefficients are just marginal proportions

c(1, 1.26) / (1 + 1.26)

[1] 0.4424779 0.5575221

prop.table(xtabs(y ~ smoker, femsmoke))

smoker
      yes        no 
0.4429224 0.5570776

3.2 Joint independence

If we assume the first two variables are dependent, and jointly independent of the third. Then \[ p_{ijk} = p_{ij} p_k \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_{ij} + \log p_k. \]
It leads to a log-linear model with just one interaction

    glm(y ~ smoker * dead + age, family = poisson, data = femsmoke) %>%
    summary()


Call:
glm(formula = y ~ smoker * dead + age, family = poisson, data = femsmoke)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      2.51582    0.12239  20.555  < 2e-16 ***
smokerno         0.50361    0.10743   4.688 2.76e-06 ***
deadno           1.15910    0.09722  11.922  < 2e-16 ***
age25-34         0.87618    0.11003   7.963 1.67e-15 ***
age35-44         0.67591    0.11356   5.952 2.65e-09 ***
age45-54         0.57536    0.11556   4.979 6.40e-07 ***
age55-64         0.70166    0.11307   6.206 5.45e-10 ***
age65-74         0.34377    0.12086   2.844  0.00445 ** 
age75+          -0.41837    0.14674  -2.851  0.00436 ** 
smokerno:deadno -0.37858    0.12566  -3.013  0.00259 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1193.9  on 27  degrees of freedom
Residual deviance:  725.8  on 18  degrees of freedom
AIC: 880

Number of Fisher Scoring iterations: 6

It improves the fit, but still not enough.

3.3 Conditional independence

Let \(P_{ij \mid k}\) be the probability that an observation falls in \((i, j)\)-cell given that we know the third variables takes value \(k\). If we assume that first two variables are independent give value of the third. Then \[ p_{ijk} = p_{ij\mid k} p_k = p_{i\mid k} p_{j \mid k} p_k = \frac{p_{ik} p_{jk}}{p_k}. \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_{ik} + \log p_{jk} - \log p_k. \]
It leads to a model with two interaction terms

glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
  summary()


Call:
glm(formula = y ~ smoker * age + dead * age, family = poisson, 
    data = femsmoke)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.34377    0.58563   0.587 0.557199    
smokerno             0.11980    0.18523   0.647 0.517785    
age25-34             0.91760    0.68737   1.335 0.181895    
age35-44             1.95402    0.62882   3.107 0.001887 ** 
age45-54             2.84979    0.60950   4.676 2.93e-06 ***
age55-64             3.44819    0.59868   5.760 8.43e-09 ***
age65-74             3.00134    0.61023   4.918 8.73e-07 ***
age75+               2.22118    0.64799   3.428 0.000609 ***
deadno               3.63759    0.58490   6.219 5.00e-10 ***
smokerno:age25-34    0.11616    0.22078   0.526 0.598789    
smokerno:age35-44   -0.01536    0.22749  -0.068 0.946172    
smokerno:age45-54   -0.63063    0.23414  -2.693 0.007074 ** 
smokerno:age55-64   -0.06894    0.22643  -0.304 0.760765    
smokerno:age65-74    1.15649    0.26427   4.376 1.21e-05 ***
smokerno:age75+      1.47413    0.35617   4.139 3.49e-05 ***
age25-34:deadno     -0.10756    0.68613  -0.157 0.875435    
age35-44:deadno     -1.33977    0.62810  -2.133 0.032920 *  
age45-54:deadno     -2.17125    0.61128  -3.552 0.000382 ***
age55-64:deadno     -3.17171    0.59999  -5.286 1.25e-07 ***
age65-74:deadno     -4.94977    0.61512  -8.047 8.49e-16 ***
age75+:deadno      -26.30450 5776.51889  -0.005 0.996367    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1193.9378  on 27  degrees of freedom
Residual deviance:    8.3269  on  7  degrees of freedom
AIC: 184.52

Number of Fisher Scoring iterations: 17

This model fits data pretty well.

Significance of predictors

glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
drop1(test = "Chi")

Single term deletions

Model:
y ~ smoker * age + dead * age
           Df Deviance    AIC    LRT  Pr(>Chi)    
<none>            8.33 184.52                     
smoker:age  6   101.83 266.03  93.51 < 2.2e-16 ***
age:dead    6   641.50 805.69 633.17 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4 Uniform association

If we consider a model with all two-way interactions \[ \log \mathbb{E}Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k + \log p_{ij} + \log p_{ik} + \log p_{jk}. \]
There is no simple interpretation in terms of independence.

modu <- glm(y ~ (smoker + age + dead)^2, family = poisson, data = femsmoke)
summary(modu)


Call:
glm(formula = y ~ (smoker + age + dead)^2, family = poisson, 
    data = femsmoke)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.54284    0.58736   0.924 0.355384    
smokerno            -0.29666    0.25324  -1.171 0.241401    
age25-34             0.92902    0.68381   1.359 0.174273    
age35-44             1.94048    0.62486   3.105 0.001900 ** 
age45-54             2.76845    0.60657   4.564 5.02e-06 ***
age55-64             3.37507    0.59550   5.668 1.45e-08 ***
age65-74             2.86586    0.60894   4.706 2.52e-06 ***
age75+               2.02211    0.64955   3.113 0.001851 ** 
deadno               3.43271    0.59014   5.817 6.00e-09 ***
smokerno:age25-34    0.11752    0.22091   0.532 0.594749    
smokerno:age35-44    0.01268    0.22800   0.056 0.955654    
smokerno:age45-54   -0.56538    0.23585  -2.397 0.016522 *  
smokerno:age55-64    0.08512    0.23573   0.361 0.718030    
smokerno:age65-74    1.49088    0.30039   4.963 6.93e-07 ***
smokerno:age75+      1.89060    0.39582   4.776 1.78e-06 ***
smokerno:deadno      0.42741    0.17703   2.414 0.015762 *  
age25-34:deadno     -0.12006    0.68655  -0.175 0.861178    
age35-44:deadno     -1.34112    0.62857  -2.134 0.032874 *  
age45-54:deadno     -2.11336    0.61210  -3.453 0.000555 ***
age55-64:deadno     -3.18077    0.60057  -5.296 1.18e-07 ***
age65-74:deadno     -5.08798    0.61951  -8.213  < 2e-16 ***
age75+:deadno      -27.31727 8839.01146  -0.003 0.997534    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1193.9378  on 27  degrees of freedom
Residual deviance:    2.3809  on  6  degrees of freedom
AIC: 180.58

Number of Fisher Scoring iterations: 18

If we compute the odds ratio for each age group

xtabs(fitted(modu) ~ smoker + dead + age, data = femsmoke) %>%
apply(3, function(x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))

   18-24    25-34    35-44    45-54    55-64    65-74      75+ 
1.533275 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275

Thus the name uniform association model.

The odds ratio can also be extracted from the coefficients

exp(coef(modu)['smokerno:deadno'])

smokerno:deadno 
       1.533275

3.5 Model selection

We can start from the saturated model (3-way interaction) and use the analysis of deviance to see how much we can reduce the model.

glm(y ~ smoker * age * dead, family = poisson, data = femsmoke) %>%
  step(test = "Chi") %>%
  drop1(test = "Chi")

Start:  AIC=190.19
y ~ smoker * age * dead

                  Df Deviance    AIC    LRT Pr(>Chi)
- smoker:age:dead  6   2.3809 180.58 2.3809   0.8815
<none>                 0.0000 190.19                

Step:  AIC=180.58
y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead

              Df Deviance    AIC    LRT Pr(>Chi)    
<none>               2.38 180.58                    
- smoker:dead  1     8.33 184.52   5.95  0.01475 *  
- smoker:age   6    92.63 258.83  90.25  < 2e-16 ***
- age:dead     6   632.30 798.49 629.92  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Single term deletions

Model:
y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead
            Df Deviance    AIC    LRT Pr(>Chi)    
<none>             2.38 180.58                    
smoker:age   6    92.63 258.83  90.25  < 2e-16 ***
smoker:dead  1     8.33 184.52   5.95  0.01475 *  
age:dead     6   632.30 798.49 629.92  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction term smoker:dead is weakly significant in the final model. This corresponds to the test of conditional independence between smoke and dead given age group. The p-value is very similar to that from the CMH test.

3.6 Binomial

It also makes intuitive sense to model life status as the outcome depending on perdictors smoker and age.

# glm(dead ~ smoker * age, family = binomial, weights = y, data = femsmoke) %>%
#   summary()
glm(matrix(femsmoke$y, ncol = 2) ~ smoker * age, family = binomial, femsmoke[1:14, ]) %>%
  step(test = "Chi") %>%
  summary()

Start:  AIC=75
matrix(femsmoke$y, ncol = 2) ~ smoker * age

             Df Deviance    AIC    LRT Pr(>Chi)
- smoker:age  6   2.3809 65.377 2.3809   0.8815
<none>            0.0000 74.996                

Step:  AIC=65.38
matrix(femsmoke$y, ncol = 2) ~ smoker + age

         Df Deviance    AIC    LRT Pr(>Chi)    
<none>          2.38  65.38                    
- smoker  1     8.33  69.32   5.95  0.01475 *  
- age     6   632.30 683.29 629.92  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Call:
glm(formula = matrix(femsmoke$y, ncol = 2) ~ smoker + age, family = binomial, 
    data = femsmoke[1:14, ])

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -3.4327     0.5901  -5.817 6.00e-09 ***
smokerno       -0.4274     0.1770  -2.414 0.015762 *  
age25-34        0.1201     0.6865   0.175 0.861178    
age35-44        1.3411     0.6286   2.134 0.032874 *  
age45-54        2.1134     0.6121   3.453 0.000555 ***
age55-64        3.1808     0.6006   5.296 1.18e-07 ***
age65-74        5.0880     0.6195   8.213  < 2e-16 ***
age75+         27.8073 11293.1430   0.002 0.998035    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 641.4963  on 13  degrees of freedom
Residual deviance:   2.3809  on  6  degrees of freedom
AIC: 65.377

Number of Fisher Scoring iterations: 20

The final model retains only main effects. This is equivalent to the uniform association model. The deviance is exactly the same.