Display system information and load tidyverse and faraway packages
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.4.1 fastmap_1.2.0 cli_3.6.4
[5] tools_4.4.1 htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10
[9] rmarkdown_2.29 knitr_1.49 jsonlite_1.9.1 xfun_0.51
[13] digest_0.6.37 rlang_1.1.5 evaluate_1.0.3
library(tidyverse)library(faraway)
faraway package contains the datasets in the ELMR book.
1 Multinomial logit model
Multinomial regression is a natural extension of the binomial regression to the case when responses are multivariate counts. Let \(Y_{ij}\) be the number of observations falling into category \(j\) for batch \(i\) and let \(n_i = \sum_j Y_{ij}\) be the batch size. Then \[
\mathbb{P}(Y_{i1} = y_{i1}, \ldots, Y_{iJ} = y_{iJ}) = \binom{n_i}{y_{i1} \cdots y_{iJ}} p_{i1}^{y_{i1}} \cdots p_{iJ}^{y_{iJ}}.
\]
We distinguish between nominal multinomial data where there is no natural order to the categories and ordinal multinomial data where there is an order. The multinomial logit model is indended for nominal data. It can be used for ordinal data, but the informtion about order will not be used.
We use the same idea as the logistic regression. \[
\eta_{ij} = \mathbf{x}_i^T \boldsymbol{\beta}_j = \log \frac{p_{ij}}{p_{i1}}, \quad j = 2,\ldots,J.
\] We need to obey the constraint that \(\sum_{i=1}^J p_{ij} = 1\) so it is convenient to declare one of the categories as the baseline say \(j=1\). Then we set \(p_{i1} = 1 - \sum_{j=2}^J p_{ij}\) and have \[
p_{ij} = \frac{\exp(\eta_{ij})}{1 + \sum_{j=2}^J \exp(\eta_{ij})}.
\]
Suppose the length of \(\mathbf{x}_i\) is \(p\), then the multinomial-logit model consumes \(p(J-1)\) parameters.
We may estimate the parameters using MLE and the standard methods for inference.
1.1 1996 election data
Consider an example of the 1996 American National Election Study. We consider only the education level and income group of the respondents. We will study how party identification of the respondent (Democrat, Independent or Republican) depends on predictors age, education level, and income.
(nes96 <-as_tibble(nes96))
# A tibble: 944 × 10
popul TVnews selfLR ClinLR DoleLR PID age educ income vote
<int> <int> <ord> <ord> <ord> <ord> <int> <ord> <ord> <fct>
1 0 7 extCon extLib Con strRep 36 HS $3Kminus Dole
2 190 1 sliLib sliLib sliCon weakDem 20 Coll $3Kminus Clinton
3 31 7 Lib Lib Con weakDem 24 BAdeg $3Kminus Clinton
4 83 4 sliLib Mod sliCon weakDem 28 BAdeg $3Kminus Clinton
5 640 7 sliCon Con Mod strDem 68 BAdeg $3Kminus Clinton
6 110 3 sliLib Mod Con weakDem 21 Coll $3Kminus Clinton
7 100 7 sliCon Con Mod weakDem 77 Coll $3Kminus Clinton
8 31 1 sliCon Mod sliCon indRep 21 Coll $3Kminus Clinton
9 180 7 Mod Con sliLib indind 31 Coll $3Kminus Clinton
10 2800 0 sliLib sliLib extCon strDem 39 HS $3Kminus Clinton
# ℹ 934 more rows
Numerical summaries:
summary(nes96)
popul TVnews selfLR ClinLR DoleLR
Min. : 0.0 Min. :0.000 extLib: 16 extLib:109 extLib: 13
1st Qu.: 1.0 1st Qu.:1.000 Lib :103 Lib :317 Lib : 31
Median : 22.0 Median :3.000 sliLib:147 sliLib:236 sliLib: 43
Mean : 306.4 Mean :3.728 Mod :256 Mod :160 Mod : 87
3rd Qu.: 110.0 3rd Qu.:7.000 sliCon:170 sliCon: 67 sliCon:195
Max. :7300.0 Max. :7.000 Con :218 Con : 36 Con :460
extCon: 34 extCon: 19 extCon:115
PID age educ income vote
strDem :200 Min. :19.00 MS : 13 $60K-$75K:103 Clinton:551
weakDem:180 1st Qu.:34.00 HSdrop: 52 $50K-$60K:100 Dole :393
indDem :108 Median :44.00 HS :248 $30K-$35K: 70
indind : 37 Mean :47.04 Coll :187 $25K-$30K: 68
indRep : 94 3rd Qu.:58.00 CCdeg : 90 $105Kplus: 68
weakRep:150 Max. :91.00 BAdeg :227 $35K-$40K: 62
strRep :175 MAdeg :127 (Other) :473
The original variable PID (party identification) has ordered categories strDem (strong democratic), weakDem (weak democratic), indDem (independent democratic), indind (independent), indRep (independent republican), weakRep (weak Republican), and strRep (strong republican). In this analysis we only consider the coarsened categories: Democratic, Independent, and Republican.
The income variable in the original data was an ordered factor with income ranges. We convert this to a numeric variable by taking the midpoint of each range.
To verify that the trends observed in these graphs are statistical significant, we need a formal model. The multinom function is avaible in the nnet package.
library(nnet)mmod <-multinom(sPID ~ age + educ + nincome, data = nes96)
# weights: 30 (18 variable)
initial value 1037.090001
iter 10 value 990.568608
iter 20 value 984.319052
final value 984.166272
converged
We can select which variables to include in the model based the AIC criterion using a stepwise search method.
step(mmod)
Start: AIC=2004.33
sPID ~ age + educ + nincome
trying - age
# weights: 27 (16 variable)
initial value 1037.090001
iter 10 value 988.896864
iter 20 value 985.822223
final value 985.812737
converged
trying - educ
# weights: 12 (6 variable)
initial value 1037.090001
iter 10 value 992.269502
final value 992.269484
converged
trying - nincome
# weights: 27 (16 variable)
initial value 1037.090001
iter 10 value 1009.025560
iter 20 value 1006.961593
final value 1006.955275
converged
Df AIC
- educ 6 1996.539
- age 16 2003.625
<none> 18 2004.333
- nincome 16 2045.911
# weights: 12 (6 variable)
initial value 1037.090001
iter 10 value 992.269502
final value 992.269484
converged
Step: AIC=1996.54
sPID ~ age + nincome
trying - age
# weights: 9 (4 variable)
initial value 1037.090001
final value 992.712152
converged
trying - nincome
# weights: 9 (4 variable)
initial value 1037.090001
final value 1020.425203
converged
Df AIC
- age 4 1993.424
<none> 6 1996.539
- nincome 4 2048.850
# weights: 9 (4 variable)
initial value 1037.090001
final value 992.712152
converged
Step: AIC=1993.42
sPID ~ nincome
trying - nincome
# weights: 6 (2 variable)
initial value 1037.090001
final value 1020.636052
converged
Df AIC
<none> 4 1993.424
- nincome 2 2045.272
Or we can use the standard likelihood methods to derive a test to compare nested models. For example, we can fit a model with predictor nincome only and then compare the deviances.
mmodi <-multinom(sPID ~ nincome, data = nes96)
# weights: 9 (4 variable)
initial value 1037.090001
final value 992.712152
converged
The intercept terms model the probabilities of party identification when income is zero.
cc <-c(0, -1.17493, -0.95036)exp(cc) /sum(exp(cc))
[1] 0.5898166 0.1821593 0.2280242
The slope terms represent the log-odds of moving from the baseline category Democrat to Independent and Republican respectively for a unit change of $1000 in income.
(pp <-predict(mmodi, data.frame(nincome =c(0, 1)), type ="probs"))
We might consider a multinomial response with four categories. However, we can see that most births suffer no malformation and so this category dominates the other three. It is better to consider this as a hierarchical response. Consider the multinomial likelihood for the \(i\)-th observation which is proportional to: \[
p_{i1}^{y_{i1}} p_{i2}^{y_{i2}} p_{i3}^{y_{i3}} p_{i4}^{y_{i4}}
\] Define \(p_{ic} = p_{i2} + p_{i3} + p_{i4}\) which is the probability of a birth with some kind of CNS malformation. We can then write the likelihood as \[
p_{i1}^{y_{i1}} p_{ic}^{y_{i2} + y_{i3} + y_{i4}} \times \left( \frac{p_{i2}}{p_{ic}} \right)^{y_{i2}} \left( \frac{p_{i3}}{p_{ic}} \right)^{y_{i3}} \left( \frac{p_{i4}}{p_{ic}} \right)^{y_{i4}}
\] We can now separately develop a binomial model for whether malformation occurs and a multinomial model for the type of malformation.
cns %>%mutate(CNS = An + Sp + Other) %>%ggplot() +geom_point(mapping =aes(x = Water, y =log(CNS/NoCNS), color = Work), size =5) +labs(x ="Water", y ="log(CNS/NoCNS")
We observe that the proportion of CNS births falls with increasing water hardness and is higher for manual workers.
cns <- cns %>%mutate(CNS = An + Sp + Other)binmodw <-glm(cbind(CNS, NoCNS) ~ Water + Work , cns, family = binomial)binmoda <-glm(cbind(CNS,NoCNS) ~ Water, cns, family=binomial)anova(binmodw, binmoda, test ="Chi")
Analysis of Deviance Table
Model 1: cbind(CNS, NoCNS) ~ Water + Work
Model 2: cbind(CNS, NoCNS) ~ Water
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 13 12.363
2 14 25.359 -1 -12.996 0.0003122 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#halfnorm(hatvalues(binmodw))#cns %>% slice(10)
sumary(binmodw)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.4325803 0.0897889 -49.3667 < 2.2e-16
Water -0.0032644 0.0009684 -3.3709 0.0007492
WorkNonManual -0.3390577 0.0970943 -3.4920 0.0004793
n = 16 p = 3
Deviance = 12.36281 Null Deviance = 41.04749 (Difference = 28.68468)
The residual deviance is close to the degrees of freedom indicating a reasonable fit to the data. We see that since: exp(-0.339058)=0.7124411, births to nonmanual workers have a 29% lower chance of CNS malformation. Water hardness ranges from about 40 to 160. So a difference of 120 would decrease the odds of CNS malformation by about 32% (i.e., 1-exp(-0.0032644*120).
Now consider a multinomial model for the three malformation types conditional on a malformation having occurred.
cmmod <-multinom(cbind(An, Sp, Other) ~ Water + Work, cns)
# weights: 12 (6 variable)
initial value 762.436928
iter 10 value 685.762336
final value 685.762238
converged
We find that neither predictor has much effect:
nmod <-step(cmmod)
Start: AIC=1383.52
cbind(An, Sp, Other) ~ Water + Work
trying - Water
# weights: 9 (4 variable)
initial value 762.436928
iter 10 value 686.562074
final value 686.562063
converged
trying - Work
# weights: 9 (4 variable)
initial value 762.436928
final value 686.580556
converged
Df AIC
- Water 4 1381.124
- Work 4 1381.161
<none> 6 1383.524
# weights: 9 (4 variable)
initial value 762.436928
iter 10 value 686.562074
final value 686.562063
converged
Step: AIC=1381.12
cbind(An, Sp, Other) ~ Work
trying - Work
# weights: 6 (2 variable)
initial value 762.436928
final value 687.227416
converged
Df AIC
- Work 2 1378.455
<none> 4 1381.124
# weights: 6 (2 variable)
initial value 762.436928
final value 687.227416
converged
Step: AIC=1378.45
cbind(An, Sp, Other) ~ 1
cc <-c(0,0.28963,-0.98083)names(cc) <-c("An","Sp","Other")exp(cc)/sum(exp(cc))
An Sp Other
0.3688767 0.4927946 0.1383287
So we find that water hardness and parents’ professions are related to the probability of a malformed birth, but that they have no effect on the type of malformation.
If we fit a multinomial logit model to all four categories:
mmmod <-multinom(cbind(NoCNS,An,Sp,Other) ~ Water + Work, cns)
# weights: 16 (9 variable)
initial value 117209.801938
iter 10 value 27783.394982
iter 20 value 20770.391265
iter 30 value 11403.855681
iter 40 value 5238.502151
iter 50 value 4717.065878
final value 4695.482597
converged
(Intercept) Water WorkNonManual
An 0 0.066695285 0.02338312
Sp 0 0.001835095 0.07077816
Other 0 0.840643463 0.02348335
We find that both Water and Work are significant, but that the fact that they do not distinguish the type of malformation is not easily discovered from this model.
3 Ordinal multinomial responses
Suppose we have \(J\) ordered categories and \(p_{ij} = \mathbb{P}(Y_i = j)\) for \(j=1, \ldots, J\). For ordered responses, it is more convenient to work with the cumulative probabilities \[
\gamma_{ij} = \mathbb{P}(Y_i \le j).
\] Since \(\gamma_{iJ}=1\), we only need to model \(J-1\) cumulative probabilities.
To link \(\gamma\)s to covariats \(x\), we consider \[
g(\gamma_{ij}) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta},
\] where \(\theta_j\) have to be non-decreasing in \(j\) to honor the ordering. In other words, \(\boldsymbol{\beta}\) has a uniform effect on the response categories and each category has its own intercept. Again we have at least 3 choices for the link function: logit, probit or cloglog.
Latent variable interpretation.
Suppose that \(Z_i\) is some unobserved continuous variable that might be thought of as the real underlying latent response. We only observe a discretized version of \(Z_i\) in the form of \(Y_i\) where \(Y_i = j\) is observed if \(\theta_{j−1} < Z_i ≤ \theta_j\). Further suppose that \(Z_i − \beta^T x_i\) has distribution \(F\), then: \[
\gamma_{ij} = P(Y_i\leq j) = P(Z_i\leq \theta_j) = P(Z_i-\beta^Tx_i \leq \theta_j -\beta^Tx_i) = F(\theta_j-\beta^Tx_i)
\]
Now if, for example, F follows the logistic distribution, where \(F(x) = e^x/(1+e^x)\), then:
\[
\gamma_{ij} = \frac{\text{exp}(\theta_j-\beta^Tx_i)}{1+\text{exp}(\theta_j-\beta^Tx_i)}
\] and so we would have a logit model for the cumulative probabilities \(\gamma_{ij}\). Choosing the normal distribution for the latent variable leads to a probit model, while the choice of an extreme value distribution leads to the complementary log-log. This latent variable explanation for the model is displayed as the following figure.
theta <-c(-2, -1, 2)tibble(t =seq(-6, 6, 0.1), pdf =dlogis(t, location =0, scale =1)) %>%ggplot() +geom_line(mapping =aes(x = t, y = pdf)) +geom_segment(mapping =aes(x = theta[1], y =0, xend = theta[1], yend =dlogis(theta[1]))) +geom_segment(mapping =aes(x = theta[2], y =0, xend = theta[2], yend =dlogis(theta[2]))) +geom_segment(mapping =aes(x = theta[3], y =0, xend = theta[3], yend =dlogis(theta[3]))) +labs(x ="t", y ="Density") +annotate("text", x = theta[1], y =0, label =expression(theta[1])) +annotate("text", x = theta[2], y =0, label =expression(theta[2])) +annotate("text", x = theta[3], y =0, label =expression(theta[3]))
Warning in geom_segment(mapping = aes(x = theta[1], y = 0, xend = theta[1], : All aesthetics have length 1, but the data has 121 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_segment(mapping = aes(x = theta[2], y = 0, xend = theta[2], : All aesthetics have length 1, but the data has 121 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_segment(mapping = aes(x = theta[3], y = 0, xend = theta[3], : All aesthetics have length 1, but the data has 121 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
3.1 Proportional odds model
The logit link function dictates \[
\log \frac{\gamma_{ij}}{1 - \gamma_{ij}} = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}, \quad j = 1,\ldots,J-1.
\]
The corresponding inverse link function is \[
\gamma_{ij} = \mathbb{P}(Y_i \le j)= \frac{\exp (\theta_j - \mathbf{x}_i^T \boldsymbol{\beta})}{1 + \exp (\theta_j - \mathbf{x}_i^T \boldsymbol{\beta})}.
\]
When \(\beta_j > 0\), as \(x_{ij}\) increases, \(\gamma_{ij} = \mathbb{P}(Y_i \le j)\) decreases the same amount for all \(j < J\), thus \(Y_i\) is more likely to take the largest value \(J\). This motivates the minus sign in the definition of the model since it allows easier interpretation of \(\beta\).
It is called the proportional odds model because the relative odds for \(y \le j\) comparing \(\mathbf{x}_1\) and \(\mathbf{x}_2\) are \[
\left( \frac{\gamma_j(\mathbf{x}_1)}{1 - \gamma_j(\mathbf{x}_1)} \right) / \left( \frac{\gamma_j(\mathbf{x}_2)}{1 - \gamma_j(\mathbf{x}_2)} \right) = \exp (- (\mathbf{x}_1 - \mathbf{x}_2)^T \boldsymbol{\beta}),
\] which do not dependent on \(j\). Here \(\gamma_j(\mathbf{x}_i) = \gamma_{ij}\).
We can fit a proportional odds model using the polr function from the MASS library
library(MASS)pomod <-polr(sPID ~ age + educ + nincome, data = nes96)summary(pomod)
Call:
polr(formula = sPID ~ age + educ + nincome, data = nes96)
Coefficients:
Value Std. Error t value
age 0.005775 0.003887 1.48581
educ.L 0.724087 0.384388 1.88374
educ.Q -0.781361 0.351173 -2.22500
educ.C 0.040168 0.291762 0.13767
educ^4 -0.019925 0.232429 -0.08573
educ^5 -0.079413 0.191533 -0.41462
educ^6 -0.061104 0.157747 -0.38735
nincome 0.012739 0.002140 5.95187
Intercepts:
Value Std. Error t value
Democrat|Independent 0.6449 0.2435 2.6479
Independent|Republican 1.7374 0.2493 6.9694
Residual Deviance: 1984.211
AIC: 2004.211
Stepwise regression leads to a model with only one predictor nincome.
The probability of being a Democrat is given by the area lying to the left of the leftmost of each pair of lines, while the probability of being a Republican is given by the area to the right of the rightmost of the pair. Independents are represented by the area in-between.
3.2 Ordered probit model
If we use the probit link, then \[
\Phi^{-1}(\gamma_j(\mathbf{x}_i)) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}, \quad j=1,\ldots,J-1.
\]
opmod <-polr(sPID ~ nincome, method ="probit", data = nes96)summary(opmod)
Call:
polr(formula = sPID ~ nincome, data = nes96, method = "probit")
Coefficients:
Value Std. Error t value
nincome 0.008182 0.001208 6.775
Intercepts:
Value Std. Error t value
Democrat|Independent 0.1284 0.0694 1.8510
Independent|Republican 0.7976 0.0722 11.0399
Residual Deviance: 1994.892
AIC: 2000.892
The deviance is similar to the logit link, but the coefficients appear to be different. The predictions are similar.
l <-c(8, 26, 42, 58, 74, 90, 107)predict(opmod, data.frame(nincome = il), type ="probs")
Suppose we use the cloglog link \[
\log (- \log (1 - \gamma_j(\mathbf{x}_i))) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}.
\]
The hazard of category \(j\) is the probability of falling in category \(j\) given that your category is greater than \(j\)\[
\text{Hazard}(j) = \mathbb{P}(Y_i = j \mid Y_i \ge j) = \frac{\mathbb{P}(Y_i = j)}{\mathbb{P}(Y_i \ge j)} = \frac{\gamma_{ij} - \gamma_{i,j-1}}{1 - \gamma_{i,j-1}}.
\] The quantity \(- \log \mathbb{P}(Y > j)\) is called the cumulative hazard function.
Since \[\begin{eqnarray*}
1 - \gamma_j(\mathbf{x}) &=& \mathbb{P}(Y > j) = e^{- e^{\theta_j - \mathbf{x}^T \boldsymbol{\beta}}},
\end{eqnarray*}\] we have \[\begin{eqnarray*}
\log \mathbb{P}(Y > j) = - e^{\theta_j - \mathbf{x}^T \boldsymbol{\beta}}
\end{eqnarray*}\] and \[
\frac{- \log \mathbb{P}(Y > j \mid \mathbf{x}_1)}{- \log \mathbb{P}(Y > j \mid \mathbf{x}_2)} = e^{(\mathbf{x}_2 - \mathbf{x}_1)^T \boldsymbol{\beta}}
\] or \[
\mathbb{P}(Y > j \mid \mathbf{x}_1) = [\mathbb{P}(Y > j \mid \mathbf{x}_2)]^{\exp (\mathbf{x}_2 - \mathbf{x}_1)^T \boldsymbol{\beta}}.
\] It is called the proportional hazards model because the ratio of cumulative hazards does not depend on level \(j\).
polr(sPID ~ nincome, method ="cloglog", data = nes96)