Semiconductor wafer quality data. The research question is whether the presence of particles on the die affects the quality of the wafer being produced.
# A tibble: 4 × 3
y particle quality
<dbl> <fct> <fct>
1 320 no good
2 14 yes good
3 80 no bad
4 36 yes bad
This kind of data is conveniently presented as a \(2 \times 2\) contingency table.
(ytable <-xtabs(y ~ particle + quality, data = wafer))
quality
particle good bad
no 320 80
yes 14 36
The usual Pearson \(X^2\) test compares the test statistic \[
X^2 = \sum_{i, j} \frac{(y_{ij} - \widehat \mu_{ij})^2}{\widehat \mu_{ij}},
\] where \[
\widehat \mu_{ij} = n \widehat p_i \widehat p_j = n \frac{\sum_j y_{ij}}{n} \frac{\sum_i y_{ij}}{n},
\] to \(\chi_1^2\).
summary(ytable)
Call: xtabs(formula = y ~ particle + quality, data = wafer)
Number of cases in table: 450
Number of factors: 2
Test for independence of all factors:
Chisq = 62.81, df = 1, p-value = 2.274e-15
We will analyze this data using 4 different models according to 4 different sampling schemes. They all reach the same conclusion.
We observed the manufacturing process for a certain period of time and observed 450 wafers. The data were then cross-classified. We could use a Poisson model.
We decided to sample 450 wafers. The data were then cross-classified. We could use a multinomial model.
We selected 400 wafers without particles and 50 wafers with particles and then recorded the good or bad outcome. We could use a binomial model.
We selected 400 wafers without particles and 50 wafers with particles that also included, by design, 334 good wafers and 116 bad ones. We could use a hypergeometric model.
1.1 Poisson model
We observe the manufacturing process for a certain period of time and observe 450 wafers. The data are then cross-classified. We model the observed counts by Poisson.
library(gtsummary)glm(y ~ particle + quality, family = poisson, data = wafer) %>%#tbl_regression()summary()
Call:
glm(formula = y ~ particle + quality, family = poisson, data = wafer)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.6934 0.0572 99.535 <2e-16 ***
particleyes -2.0794 0.1500 -13.863 <2e-16 ***
qualitybad -1.0575 0.1078 -9.813 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 474.10 on 3 degrees of freedom
Residual deviance: 54.03 on 1 degrees of freedom
AIC: 83.774
Number of Fisher Scoring iterations: 5
The null (intercept-only model) model assumes that the rate is constant across particle and quality levels. The analysis of deviance compares \(474.10 - 54.03\) to 2 degrees of freedom, indicating null model is rejected in favor of our two predictor Poisson model.
If we add an interaction term particle * quality, the model costs 4 parameters, the same number as the observations. It is equivalent to the full/saturated model. Analysis of deviance compares 54.03 on 1 degree of freedom, indicating the interaction term is highly significant.
Therefore our conclusion is that presence of particle in the die significantly affects the quality of wafer.
1.2 Multinomial model
If we assume the total sample size is fixed at \(n=450\). Then we can model the counts by a multinomial model. We consdier two models
\(H_1\): one full/saturated model where each cell of the \(2 \times 2\) has its own probability \(p_{ij}\), and
\(H_0\): one reduced model that assumes particle is independent of quality, thus \(p_{ij} = p_i p_j\).
For the full/saturated model \(H_1\), the multinomial likelihood is \[
\frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j p_{ij}^{y_{ij}}.
\] We estimate \(p_{ij}\) by the cell proportion (MLE) \[
\widehat{p}_{ij} = \frac{y_{ij}}{n}.
\]
For the reduced model assuming independence \(H_0\), the multinomial likelihood is \[
\frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j (p_i p_j)^{y_{ij}} = \frac{n!}{\prod_i \prod_j y_{ij}!} p_i^{\sum_j y_{ij}} p_j^{\sum_i y_{ij}}.
\] The MLE is \[
\widehat{p}_i = \frac{\sum_j y_{ij}}{n}, \quad \widehat{p}_j = \frac{\sum_i y_{ij}}{n}.
\]
We get the exact same result as the analysis of deviance in the Poisson model.
This connection between Poisson and multinomial is no surprise due to the following fact. If \(Y_1, \ldots, Y_k\) are independent Poisson with means \(\lambda_1, \ldots, \lambda_k\), then the joint distribution of \(Y_1, \ldots, Y_k \mid \sum_i Y_i = n\) is multinomial with probabilities \(p_j = \lambda_j / \sum_i \lambda_i\).
1.3 Binomial
If we view particle as a predictor affecting whether a wafer is good quality or bad quality, we end up with an independent binomial model.
The null (intercept-only) model corresponds to the hypothesis particle does not affect quality.
tibble(good =c(320, 14),bad =c(80, 36),particle =c("no", "yes")) %>%print() %>%glm(cbind(good, bad) ~1, family = binomial, data = .) %>%summary()
# A tibble: 2 × 3
good bad particle
<dbl> <dbl> <chr>
1 320 80 no
2 14 36 yes
Call:
glm(formula = cbind(good, bad) ~ 1, family = binomial, data = .)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0576 0.1078 9.813 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 54.03 on 1 degrees of freedom
Residual deviance: 54.03 on 1 degrees of freedom
AIC: 66.191
Number of Fisher Scoring iterations: 3
The alternative model corresponds to the full/saturated model where particle is included as a predictor.
Again we observe the exactly same analysis of deviance inference.
When there are more than two rows or columns, this model is called the product binoimal/multinomial model.
1.4 Hypergeometric
Finally if we fix both row and column marginal totals, the probability of the observed table is \[\begin{eqnarray*}
p(y_{11}, y_{12}, y_{21}, y_{22}) &=& \frac{\binom{y_{1\cdot}}{y_{11}} \binom{y_{2\cdot}}{y_{22}} \binom{y_{\cdot 1}}{y_{21}} \binom{y_{\cdot 2}}{y_{12}}}{\binom{n}{y_{11} \, y_{12} \, y_{21} \, y_{22}}} \\
&=& \frac{y_{1\cdot}! y_{2\cdot}! y_{\cdot 1}! y_{\cdot 2}!}{y_{11}! y_{12}! y_{21}! y_{22}! n!} \\
&=& \frac{\binom{y_{1 \cdot}}{y_{11}} \binom{y_{2 \cdot}}{y_{\cdot 1} - y_{11}}}{\binom{n}{y_{\cdot 1}}}.
\end{eqnarray*}\]
Under the null hypothesis that particle is not associated with quality, the Fisher’s exact test calculates the p-value by summing over the probabilities of tables with more extreme observations \[
\sum_{y_{11} \ge 320} p(y_{11}, y_{12}, y_{21}, y_{22}).
\]
fisher.test(ytable)
Fisher's Exact Test for Count Data
data: ytable
p-value = 2.955e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
5.090628 21.544071
sample estimates:
odds ratio
10.21331
The odds ratio is \[
\frac{\pi_{\text{no particle}} / (1 - \pi_{\text{no particle}})}{\pi_{\text{particle}} / (1 - \pi_{\text{particle}})} = \frac{y_{11} y_{22}}{y_{12} y_{21}}
\]
(320*36) / (14*80)
[1] 10.28571
2 Larger two-way tables
The haireye data set contains data on 592 statistics students cross-classifed by hair and eye color.
haireye <-as_tibble(haireye) %>%print(n =Inf)
# A tibble: 16 × 3
y eye hair
<dbl> <fct> <fct>
1 5 green BLACK
2 29 green BROWN
3 14 green RED
4 16 green BLOND
5 15 hazel BLACK
6 54 hazel BROWN
7 14 hazel RED
8 10 hazel BLOND
9 20 blue BLACK
10 84 blue BROWN
11 17 blue RED
12 94 blue BLOND
13 68 brown BLACK
14 119 brown BROWN
15 26 brown RED
16 7 brown BLOND
(haireye_table <-xtabs(y ~ hair + eye, data = haireye))
eye
hair green hazel blue brown
BLACK 5 15 20 68
BROWN 29 54 84 119
RED 14 14 17 26
BLOND 16 10 94 7
Graphical summary of the contingency table by a mosaic plot. If eye and hair are independent, we expect to see a grid.
mosaicplot(haireye_table, color =TRUE, main =NULL, las =1)
Pearson \(X^2\) test for independence yields
summary(haireye_table)
Call: xtabs(formula = y ~ hair + eye, data = haireye)
Number of cases in table: 592
Number of factors: 2
Test for independence of all factors:
Chisq = 138.29, df = 9, p-value = 2.325e-25
We fit a Poisson GLM:
modc <-glm(y ~ hair + eye, family = poisson, data = haireye)summary(modc)
Call:
glm(formula = y ~ hair + eye, family = poisson, data = haireye)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.4575 0.1523 16.136 < 2e-16 ***
hairBROWN 0.9739 0.1129 8.623 < 2e-16 ***
hairRED -0.4195 0.1528 -2.745 0.00604 **
hairBLOND 0.1621 0.1309 1.238 0.21569
eyehazel 0.3737 0.1624 2.301 0.02139 *
eyeblue 1.2118 0.1424 8.510 < 2e-16 ***
eyebrown 1.2347 0.1420 8.694 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 453.31 on 15 degrees of freedom
Residual deviance: 146.44 on 9 degrees of freedom
AIC: 241.04
Number of Fisher Scoring iterations: 5
which clearly shows a lack of fit. The interaction model is equivalent to the full/saturated model. Therefore we see strong evidence for the dependence between eye and hair.
3 Three-way contingency tables
The femsoke data records female smokers and non-smokers, their age group, and whether the subjects were dead or still alive after 20 years.
femsmoke <-as_tibble(femsmoke) %>%print(n =Inf)
# A tibble: 28 × 4
y smoker dead age
<dbl> <fct> <fct> <fct>
1 2 yes yes 18-24
2 1 no yes 18-24
3 3 yes yes 25-34
4 5 no yes 25-34
5 14 yes yes 35-44
6 7 no yes 35-44
7 27 yes yes 45-54
8 12 no yes 45-54
9 51 yes yes 55-64
10 40 no yes 55-64
11 29 yes yes 65-74
12 101 no yes 65-74
13 13 yes yes 75+
14 64 no yes 75+
15 53 yes no 18-24
16 61 no no 18-24
17 121 yes no 25-34
18 152 no no 25-34
19 95 yes no 35-44
20 114 no no 35-44
21 103 yes no 45-54
22 66 no no 45-54
23 64 yes no 55-64
24 81 no no 55-64
25 7 yes no 65-74
26 28 no no 65-74
27 0 yes no 75+
28 0 no no 75+
There are three factors smoker, dead, and age. If we just classify over smoker and dead
(ct <-xtabs(y ~ smoker + dead, data = femsmoke))
dead
smoker yes no
yes 139 443
no 230 502
prop.table(ct, 1)
dead
smoker yes no
yes 0.2388316 0.7611684
no 0.3142077 0.6857923
we see significantly higher percentage of non-smokers died than smokers.
summary(ct)
Call: xtabs(formula = y ~ smoker + dead, data = femsmoke)
Number of cases in table: 1314
Number of factors: 2
Test for independence of all factors:
Chisq = 9.121, df = 1, p-value = 0.002527
However smoke status is confounded with the age group. Smokers are more concentrated in the younger groups and younger people are more likely to live for another 20 years.
prop.table(xtabs(y ~ smoker + age, data = femsmoke), 2)
The Cochran-Mantel-Haenszel (CMH) tests independence in \(2 \times 2\) tables across \(K\) strata. It compares the statistic \[
\frac{(|\sum_k y_{11k} - \sum_k \mathbb{E}y_{11k}| - 1/2)^2}{\sum_k \operatorname{Var} y_{11k}},
\] where the expectation and variance are computed under the null hypothesis of independence in each stratum, to asymptotic null distribution \(\chi_1^2\) and calculate the p-value by exact test.
mantelhaen.test(ct3, exact =TRUE)
Exact conditional test of independence in 2 x 2 x k tables
data: ct3
S = 139, p-value = 0.01591
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
1.068889 2.203415
sample estimates:
common odds ratio
1.530256
3.1 Mutual independence model
Under mutual independence, \[
p_{ijk} = p_i p_j p_k
\] so \[
\mathbb{E} Y_{ijk} = n p_{ijk}
\] and \[
\log \mathbb{E} Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k.
\] So the main effect-only model corresponds to the mutual independence model.
We can test independence by the Pearson \(X^2\) test
summary(ct3)
Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke)
Number of cases in table: 1314
Number of factors: 3
Test for independence of all factors:
Chisq = 790.6, df = 19, p-value = 2.14e-155
or by a Poisson GLM
modi <-glm(y ~ smoker + dead + age, family = poisson, data = femsmoke)summary(modi)
Call:
glm(formula = y ~ smoker + dead + age, family = poisson, data = femsmoke)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.67778 0.10702 25.021 < 2e-16 ***
smokerno 0.22931 0.05554 4.129 3.64e-05 ***
deadno 0.94039 0.06139 15.319 < 2e-16 ***
age25-34 0.87618 0.11003 7.963 1.67e-15 ***
age35-44 0.67591 0.11356 5.952 2.65e-09 ***
age45-54 0.57536 0.11556 4.979 6.40e-07 ***
age55-64 0.70166 0.11307 6.206 5.45e-10 ***
age65-74 0.34377 0.12086 2.844 0.00445 **
age75+ -0.41837 0.14674 -2.851 0.00436 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1193.9 on 27 degrees of freedom
Residual deviance: 735.0 on 19 degrees of freedom
AIC: 887.2
Number of Fisher Scoring iterations: 6
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
The exponentiated coefficients are just marginal proportions
c(1, 1.26) / (1+1.26)
[1] 0.4424779 0.5575221
prop.table(xtabs(y ~ smoker, femsmoke))
smoker
yes no
0.4429224 0.5570776
3.2 Joint independence
If we assume the first two variables are dependent, and jointly independent of the third. Then \[
p_{ijk} = p_{ij} p_k
\] and \[
\log \mathbb{E} Y_{ijk} = \log n + \log p_{ij} + \log p_k.
\]
It leads to a log-linear model with just one interaction
glm(y ~ smoker * dead + age, family = poisson, data = femsmoke) %>%summary()
Call:
glm(formula = y ~ smoker * dead + age, family = poisson, data = femsmoke)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.51582 0.12239 20.555 < 2e-16 ***
smokerno 0.50361 0.10743 4.688 2.76e-06 ***
deadno 1.15910 0.09722 11.922 < 2e-16 ***
age25-34 0.87618 0.11003 7.963 1.67e-15 ***
age35-44 0.67591 0.11356 5.952 2.65e-09 ***
age45-54 0.57536 0.11556 4.979 6.40e-07 ***
age55-64 0.70166 0.11307 6.206 5.45e-10 ***
age65-74 0.34377 0.12086 2.844 0.00445 **
age75+ -0.41837 0.14674 -2.851 0.00436 **
smokerno:deadno -0.37858 0.12566 -3.013 0.00259 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1193.9 on 27 degrees of freedom
Residual deviance: 725.8 on 18 degrees of freedom
AIC: 880
Number of Fisher Scoring iterations: 6
It improves the fit, but still not enough.
3.3 Conditional independence
Let \(P_{ij \mid k}\) be the probability that an observation falls in \((i, j)\)-cell given that we know the third variables takes value \(k\). If we assume that first two variables are independent give value of the third. Then \[
p_{ijk} = p_{ij\mid k} p_k = p_{i\mid k} p_{j \mid k} p_k = \frac{p_{ik} p_{jk}}{p_k}.
\] and \[
\log \mathbb{E} Y_{ijk} = \log n + \log p_{ik} + \log p_{jk} - \log p_k.
\]
It leads to a model with two interaction terms
glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%summary()
glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%drop1(test ="Chi")
Single term deletions
Model:
y ~ smoker * age + dead * age
Df Deviance AIC LRT Pr(>Chi)
<none> 8.33 184.52
smoker:age 6 101.83 266.03 93.51 < 2.2e-16 ***
age:dead 6 641.50 805.69 633.17 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.4 Uniform association
If we consider a model with all two-way interactions \[
\log \mathbb{E}Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k + \log p_{ij} + \log p_{ik} + \log p_{jk}.
\]
There is no simple interpretation in terms of independence.
modu <-glm(y ~ (smoker + age + dead)^2, family = poisson, data = femsmoke)summary(modu)
The interaction term smoker:dead is weakly significant in the final model. This corresponds to the test of conditional independence between smoke and dead given age group. The p-value is very similar to that from the CMH test.
3.6 Binomial
It also makes intuitive sense to model life status as the outcome depending on perdictors smoker and age.
# glm(dead ~ smoker * age, family = binomial, weights = y, data = femsmoke) %>%# summary()glm(matrix(femsmoke$y, ncol =2) ~ smoker * age, family = binomial, femsmoke[1:14, ]) %>%step(test ="Chi") %>%summary()