Multinomial Data (ELMR Chapter 7)

Biostat 200C

Author

Dr. Jin Zhou @ UCLA

Published

May 1, 2025

Display system information and load tidyverse and faraway packages

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.1    fastmap_1.2.0     cli_3.6.4        
 [5] tools_4.4.1       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.49        jsonlite_1.9.1    xfun_0.51        
[13] digest_0.6.37     rlang_1.1.5       evaluate_1.0.3

library(tidyverse)
library(faraway)

faraway package contains the datasets in the ELMR book.

1 Multinomial logit model

Multinomial regression is a natural extension of the binomial regression to the case when responses are multivariate counts. Let $Y_{ij}$ be the number of observations falling into category $j$ for batch $i$ and let $n_i = \sum_j Y_{ij}$ be the batch size. Then \[ \mathbb{P}(Y_{i1} = y_{i1}, \ldots, Y_{iJ} = y_{iJ}) = \binom{n_i}{y_{i1} \cdots y_{iJ}} p_{i1}^{y_{i1}} \cdots p_{iJ}^{y_{iJ}}. \]
We distinguish between nominal multinomial data where there is no natural order to the categories and ordinal multinomial data where there is an order. The multinomial logit model is indended for nominal data. It can be used for ordinal data, but the informtion about order will not be used.
We use the same idea as the logistic regression. \[ \eta_{ij} = \mathbf{x}_i^T \boldsymbol{\beta}_j = \log \frac{p_{ij}}{p_{i1}}, \quad j = 2,\ldots,J. \] We need to obey the constraint that $\sum_{i=1}^J p_{ij} = 1$ so it is convenient to declare one of the categories as the baseline say $j=1$. Then we set $p_{i1} = 1 - \sum_{j=2}^J p_{ij}$ and have \[ p_{ij} = \frac{\exp(\eta_{ij})}{1 + \sum_{j=2}^J \exp(\eta_{ij})}. \]
Suppose the length of $\mathbf{x}_i$ is $p$, then the multinomial-logit model consumes $p(J-1)$ parameters.
We may estimate the parameters using MLE and the standard methods for inference.

1.1 1996 election data

Consider an example of the 1996 American National Election Study. We consider only the education level and income group of the respondents. We will study how party identification of the respondent (Democrat, Independent or Republican) depends on predictors age, education level, and income.

(nes96 <- as_tibble(nes96))

# A tibble: 944 × 10
   popul TVnews selfLR ClinLR DoleLR PID       age educ  income   vote   
   <int>  <int> <ord>  <ord>  <ord>  <ord>   <int> <ord> <ord>    <fct>  
 1     0      7 extCon extLib Con    strRep     36 HS    $3Kminus Dole   
 2   190      1 sliLib sliLib sliCon weakDem    20 Coll  $3Kminus Clinton
 3    31      7 Lib    Lib    Con    weakDem    24 BAdeg $3Kminus Clinton
 4    83      4 sliLib Mod    sliCon weakDem    28 BAdeg $3Kminus Clinton
 5   640      7 sliCon Con    Mod    strDem     68 BAdeg $3Kminus Clinton
 6   110      3 sliLib Mod    Con    weakDem    21 Coll  $3Kminus Clinton
 7   100      7 sliCon Con    Mod    weakDem    77 Coll  $3Kminus Clinton
 8    31      1 sliCon Mod    sliCon indRep     21 Coll  $3Kminus Clinton
 9   180      7 Mod    Con    sliLib indind     31 Coll  $3Kminus Clinton
10  2800      0 sliLib sliLib extCon strDem     39 HS    $3Kminus Clinton
# ℹ 934 more rows

Numerical summaries:

summary(nes96)

     popul            TVnews         selfLR       ClinLR       DoleLR   
 Min.   :   0.0   Min.   :0.000   extLib: 16   extLib:109   extLib: 13  
 1st Qu.:   1.0   1st Qu.:1.000   Lib   :103   Lib   :317   Lib   : 31  
 Median :  22.0   Median :3.000   sliLib:147   sliLib:236   sliLib: 43  
 Mean   : 306.4   Mean   :3.728   Mod   :256   Mod   :160   Mod   : 87  
 3rd Qu.: 110.0   3rd Qu.:7.000   sliCon:170   sliCon: 67   sliCon:195  
 Max.   :7300.0   Max.   :7.000   Con   :218   Con   : 36   Con   :460  
                                  extCon: 34   extCon: 19   extCon:115  
      PID           age            educ           income         vote    
 strDem :200   Min.   :19.00   MS    : 13   $60K-$75K:103   Clinton:551  
 weakDem:180   1st Qu.:34.00   HSdrop: 52   $50K-$60K:100   Dole   :393  
 indDem :108   Median :44.00   HS    :248   $30K-$35K: 70                
 indind : 37   Mean   :47.04   Coll  :187   $25K-$30K: 68                
 indRep : 94   3rd Qu.:58.00   CCdeg : 90   $105Kplus: 68                
 weakRep:150   Max.   :91.00   BAdeg :227   $35K-$40K: 62                
 strRep :175                   MAdeg :127   (Other)  :473

The original variable PID (party identification) has ordered categories strDem (strong democratic), weakDem (weak democratic), indDem (independent democratic), indind (independent), indRep (independent republican), weakRep (weak Republican), and strRep (strong republican). In this analysis we only consider the coarsened categories: Democratic, Independent, and Republican.

nes96 <- nes96 %>%
  mutate(sPID = recode(PID, 
                       strDem  = "Democrat", 
                       weakDem = "Democrat", 
                       indDem  = "Independent", 
                       indind  = "Independent", 
                       indRep  = "Independent", 
                       weakRep = "Republican", 
                       strRep  = "Republican"))
summary(nes96$sPID)

   Democrat Independent  Republican 
        380         239         325

The income variable in the original data was an ordered factor with income ranges. We convert this to a numeric variable by taking the midpoint of each range.

table(nes96$income)


  $3Kminus    $3K-$5K    $5K-$7K    $7K-$9K   $9K-$10K  $10K-$11K  $11K-$12K 
        19         12         17         19         18         13         11 
 $12K-$13K  $13K-$14K  $14K-$15K  $15K-$17K  $17K-$20K  $20K-$22K  $22K-$25K 
        17         10         15         23         35         26         39 
 $25K-$30K  $30K-$35K  $35K-$40K  $40K-$45K  $45K-$50K  $50K-$60K  $60K-$75K 
        68         70         62         48         51        100        103 
 $75K-$90K $90K-$105K  $105Kplus 
        53         47         68

inca <- c(1.5, 4, 6, 8, 9.5, 10.5, 11.5, 12.5, 13.5, 
          14.5, 16, 18.5, 21, 23.5, 27.5, 32.5, 37.5, 
          42.5, 47.5, 55, 67.5, 82.5, 97.5, 115)
nes96 <- nes96 %>%
  mutate(nincome = inca[unclass(income)])
summary(nes96$nincome)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.50   23.50   37.50   46.58   67.50  115.00

Graphical summaries.

How does party identification changes with education level?

nes96 %>%
  count(educ, sPID) %>%
  group_by(educ) %>%
  mutate(prop = prop.table(n)) %>%
  print(width = Inf) %>%
  ggplot() +
  geom_line(mapping = aes(x = educ, y = prop, group = sPID, color = sPID)) + 
  scale_color_manual(values = c("blue", "green", "red")) + 
  labs(x = "Education", y = "Proportion")

# A tibble: 21 × 4
# Groups:   educ [7]
   educ   sPID            n   prop
   <ord>  <ord>       <int>  <dbl>
 1 MS     Democrat        9 0.692 
 2 MS     Independent     3 0.231 
 3 MS     Republican      1 0.0769
 4 HSdrop Democrat       29 0.558 
 5 HSdrop Independent    14 0.269 
 6 HSdrop Republican      9 0.173 
 7 HS     Democrat      108 0.435 
 8 HS     Independent    63 0.254 
 9 HS     Republican     77 0.310 
10 Coll   Democrat       74 0.396 
# ℹ 11 more rows

How does party identification changes with income level?

nes96 %>%
  count(income, sPID) %>%
  group_by(income) %>%
  mutate(prop = prop.table(n)) %>%
  print(width = Inf) %>%
  ggplot() +
  geom_line(mapping = aes(x = income, y = prop, group = sPID, color = sPID)) + 
  scale_color_manual(values = c("blue", "green", "red")) + 
  labs(x = "Income", y = "Proportion") + 
  theme(axis.text.x = element_text(angle = 90))

# A tibble: 72 × 4
# Groups:   income [24]
   income   sPID            n   prop
   <ord>    <ord>       <int>  <dbl>
 1 $3Kminus Democrat       13 0.684 
 2 $3Kminus Independent     3 0.158 
 3 $3Kminus Republican      3 0.158 
 4 $3K-$5K  Democrat        7 0.583 
 5 $3K-$5K  Independent     4 0.333 
 6 $3K-$5K  Republican      1 0.0833
 7 $5K-$7K  Democrat        9 0.529 
 8 $5K-$7K  Independent     3 0.176 
 9 $5K-$7K  Republican      5 0.294 
10 $7K-$9K  Democrat       11 0.579 
# ℹ 62 more rows

How does party identification changes with age?

nes96 %>%
  mutate(age_grp = cut(age, breaks = seq(from = 10, to = 100, by = 10))) %>%
  count(age_grp, sPID) %>%
  group_by(age_grp) %>%
  mutate(prop = prop.table(n)) %>%
  print(width = Inf) %>%
  ggplot() +
  geom_line(mapping = aes(x = age_grp, y = prop, group = sPID, color = sPID)) + 
  scale_color_manual(values = c("blue", "green", "red")) + 
  labs(x = "Age", y = "Proportion") + 
  theme(axis.text.x = element_text(angle = 90))

# A tibble: 26 × 4
# Groups:   age_grp [9]
   age_grp sPID            n  prop
   <fct>   <ord>       <int> <dbl>
 1 (10,20] Democrat        3 0.333
 2 (10,20] Independent     2 0.222
 3 (10,20] Republican      4 0.444
 4 (20,30] Democrat       60 0.438
 5 (20,30] Independent    37 0.270
 6 (20,30] Republican     40 0.292
 7 (30,40] Democrat       87 0.348
 8 (30,40] Independent    65 0.26 
 9 (30,40] Republican     98 0.392
10 (40,50] Democrat       89 0.441
# ℹ 16 more rows

1.2 Fit multinomial-logit

To verify that the trends observed in these graphs are statistical significant, we need a formal model. The multinom function is avaible in the nnet package.

library(nnet)
mmod <- multinom(sPID ~ age + educ + nincome, data = nes96)

# weights:  30 (18 variable)
initial  value 1037.090001 
iter  10 value 990.568608
iter  20 value 984.319052
final  value 984.166272 
converged

summary(mmod)

Call:
multinom(formula = sPID ~ age + educ + nincome, data = nes96)

Coefficients:
            (Intercept)          age     educ.L     educ.Q    educ.C
Independent   -1.197260 0.0001534525 0.06351451 -0.1217038 0.1119542
Republican    -1.642656 0.0081943691 1.19413345 -1.2292869 0.1544575
                 educ^4     educ^5      educ^6    nincome
Independent -0.07657336  0.1360851  0.15427826 0.01623911
Republican  -0.02827297 -0.1221176 -0.03741389 0.01724679

Std. Errors:
            (Intercept)         age    educ.L    educ.Q    educ.C    educ^4
Independent   0.3265951 0.005374592 0.4571884 0.4142859 0.3498491 0.2883031
Republican    0.3312877 0.004902668 0.6502670 0.6041924 0.4866432 0.3605620
               educ^5    educ^6     nincome
Independent 0.2494706 0.2171578 0.003108585
Republican  0.2696036 0.2031859 0.002881745

Residual Deviance: 1968.333 
AIC: 2004.333

We can select which variables to include in the model based the AIC criterion using a stepwise search method.

step(mmod)

Start:  AIC=2004.33
sPID ~ age + educ + nincome

trying - age 
# weights:  27 (16 variable)
initial  value 1037.090001 
iter  10 value 988.896864
iter  20 value 985.822223
final  value 985.812737 
converged
trying - educ 
# weights:  12 (6 variable)
initial  value 1037.090001 
iter  10 value 992.269502
final  value 992.269484 
converged
trying - nincome 
# weights:  27 (16 variable)
initial  value 1037.090001 
iter  10 value 1009.025560
iter  20 value 1006.961593
final  value 1006.955275 
converged
          Df      AIC
- educ     6 1996.539
- age     16 2003.625
<none>    18 2004.333
- nincome 16 2045.911
# weights:  12 (6 variable)
initial  value 1037.090001 
iter  10 value 992.269502
final  value 992.269484 
converged

Step:  AIC=1996.54
sPID ~ age + nincome

trying - age 
# weights:  9 (4 variable)
initial  value 1037.090001 
final  value 992.712152 
converged
trying - nincome 
# weights:  9 (4 variable)
initial  value 1037.090001 
final  value 1020.425203 
converged
          Df      AIC
- age      4 1993.424
<none>     6 1996.539
- nincome  4 2048.850
# weights:  9 (4 variable)
initial  value 1037.090001 
final  value 992.712152 
converged

Step:  AIC=1993.42
sPID ~ nincome

trying - nincome 
# weights:  6 (2 variable)
initial  value 1037.090001 
final  value 1020.636052 
converged
          Df      AIC
<none>     4 1993.424
- nincome  2 2045.272

Call:
multinom(formula = sPID ~ nincome, data = nes96)

Coefficients:
            (Intercept)    nincome
Independent  -1.1749331 0.01608683
Republican   -0.9503591 0.01766457

Residual Deviance: 1985.424 
AIC: 1993.424

Or we can use the standard likelihood methods to derive a test to compare nested models. For example, we can fit a model with predictor nincome only and then compare the deviances.

mmodi <- multinom(sPID ~ nincome, data = nes96)

# weights:  9 (4 variable)
initial  value 1037.090001 
final  value 992.712152 
converged

deviance(mmodi) - deviance(mmod)

[1] 17.09176

mmod$edf - mmodi$edf

[1] 14

pchisq(deviance(mmodi) - deviance(mmod), mmod$edf - mmodi$edf, lower = F)

[1] 0.2513209

To interpret the coefficients:

summary(mmodi)

Call:
multinom(formula = sPID ~ nincome, data = nes96)

Coefficients:
            (Intercept)    nincome
Independent  -1.1749331 0.01608683
Republican   -0.9503591 0.01766457

Std. Errors:
            (Intercept)     nincome
Independent   0.1536103 0.002849738
Republican    0.1416859 0.002652532

Residual Deviance: 1985.424 
AIC: 1993.424

The intercept terms model the probabilities of party identification when income is zero.

cc <- c(0, -1.17493, -0.95036)
exp(cc) / sum(exp(cc))

[1] 0.5898166 0.1821593 0.2280242

The slope terms represent the log-odds of moving from the baseline category Democrat to Independent and Republican respectively for a unit change of $1000 in income.

(pp <- predict(mmodi, data.frame(nincome = c(0, 1)), type = "probs"))

   Democrat Independent Republican
1 0.5898168   0.1821588  0.2280244
2 0.5857064   0.1838228  0.2304708

log(pp[1, 1] * pp[2, 2] / (pp[1, 2] * pp[2, 1]))

[1] 0.01608683

log(pp[1, 1] * pp[2, 3] / (pp[1, 3] * pp[2, 1]))

[1] 0.01766457

We can obtain predicted values for specified values of income.

il <- c(8, 26, 42, 58, 74, 90, 107)
predict(mmodi, data.frame(nincome = il), type = "probs") %>%
  as_tibble() %>%
  mutate(income = il) %>%
  pivot_longer(Democrat:Republican, names_to = "sPID", values_to = "prob") %>%
  ggplot() + 
  geom_line(mapping = aes(x = income, y = prob, group = sPID, color = sPID)) +
  scale_color_manual(values = c("blue", "green", "red")) + 
  labs(x = "Income", y = "Probability")

2 Hierarchical or nested responses

Consider following data concerning live births with deformations of the central nervous system (CNS) in south Wales.

(cns <- as_tibble(cns))

# A tibble: 16 × 7
   Area          NoCNS    An    Sp Other Water Work     
   <fct>         <int> <int> <int> <int> <int> <fct>    
 1 Cardiff        4091     5     9     5   110 NonManual
 2 Newport        1515     1     7     0   100 NonManual
 3 Swansea        2394     9     5     0    95 NonManual
 4 GlamorganE     3163     9    14     3    42 NonManual
 5 GlamorganW     1979     5    10     1    39 NonManual
 6 GlamorganC     4838    11    12     2   161 NonManual
 7 MonmouthV      2362     6     8     4    83 NonManual
 8 MonmouthOther  1604     3     6     0   122 NonManual
 9 Cardiff        9424    31    33    14   110 Manual   
10 Newport        4610     3    15     6   100 Manual   
11 Swansea        5526    19    30     4    95 Manual   
12 GlamorganE    13217    55    71    19    42 Manual   
13 GlamorganW     8195    30    44    10    39 Manual   
14 GlamorganC     7803    25    28    12   161 Manual   
15 MonmouthV      9962    36    37    13    83 Manual   
16 MonmouthOther  3172     8    13     3   122 Manual

Responses:
- NoCNS: no CNS
- An: anencephalus
- Sp: sina bifida
- Other: other malformations
Predictors:
- Water: water hardness
- Work: type of work performed by the parents
We might consider a multinomial response with four categories. However, we can see that most births suffer no malformation and so this category dominates the other three. It is better to consider this as a hierarchical response. Consider the multinomial likelihood for the $i$-th observation which is proportional to: \[ p_{i1}^{y_{i1}} p_{i2}^{y_{i2}} p_{i3}^{y_{i3}} p_{i4}^{y_{i4}} \] Define $p_{ic} = p_{i2} + p_{i3} + p_{i4}$ which is the probability of a birth with some kind of CNS malformation. We can then write the likelihood as \[ p_{i1}^{y_{i1}} p_{ic}^{y_{i2} + y_{i3} + y_{i4}} \times \left( \frac{p_{i2}}{p_{ic}} \right)^{y_{i2}} \left( \frac{p_{i3}}{p_{ic}} \right)^{y_{i3}} \left( \frac{p_{i4}}{p_{ic}} \right)^{y_{i4}} \] We can now separately develop a binomial model for whether malformation occurs and a multinomial model for the type of malformation.

cns %>% 
  mutate(CNS = An + Sp + Other) %>%
  ggplot() + 
  geom_point(mapping = aes(x = Water, y = log(CNS/NoCNS), color = Work), size = 5) +
  labs(x = "Water", y = "log(CNS/NoCNS")

We observe that the proportion of CNS births falls with increasing water hardness and is higher for manual workers.

cns <- cns %>% 
  mutate(CNS = An + Sp + Other)
binmodw <- glm(cbind(CNS, NoCNS) ~ Water + Work , cns, family = binomial)
binmoda <- glm(cbind(CNS,NoCNS) ~ Water, cns, family=binomial)
anova(binmodw, binmoda, test = "Chi")

Analysis of Deviance Table

Model 1: cbind(CNS, NoCNS) ~ Water + Work
Model 2: cbind(CNS, NoCNS) ~ Water
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        13     12.363                          
2        14     25.359 -1  -12.996 0.0003122 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#halfnorm(hatvalues(binmodw))
#cns %>% slice(10)

sumary(binmodw)

                Estimate Std. Error  z value  Pr(>|z|)
(Intercept)   -4.4325803  0.0897889 -49.3667 < 2.2e-16
Water         -0.0032644  0.0009684  -3.3709 0.0007492
WorkNonManual -0.3390577  0.0970943  -3.4920 0.0004793

n = 16 p = 3
Deviance = 12.36281 Null Deviance = 41.04749 (Difference = 28.68468)

The residual deviance is close to the degrees of freedom indicating a reasonable fit to the data. We see that since: exp(-0.339058)=0.7124411, births to nonmanual workers have a 29% lower chance of CNS malformation. Water hardness ranges from about 40 to 160. So a difference of 120 would decrease the odds of CNS malformation by about 32% (i.e., 1-exp(-0.0032644*120).

Now consider a multinomial model for the three malformation types conditional on a malformation having occurred.

cmmod <- multinom(cbind(An, Sp, Other) ~ Water + Work, cns)

# weights:  12 (6 variable)
initial  value 762.436928 
iter  10 value 685.762336
final  value 685.762238 
converged

We find that neither predictor has much effect:

nmod <- step(cmmod)

Start:  AIC=1383.52
cbind(An, Sp, Other) ~ Water + Work

trying - Water 
# weights:  9 (4 variable)
initial  value 762.436928 
iter  10 value 686.562074
final  value 686.562063 
converged
trying - Work 
# weights:  9 (4 variable)
initial  value 762.436928 
final  value 686.580556 
converged
        Df      AIC
- Water  4 1381.124
- Work   4 1381.161
<none>   6 1383.524
# weights:  9 (4 variable)
initial  value 762.436928 
iter  10 value 686.562074
final  value 686.562063 
converged

Step:  AIC=1381.12
cbind(An, Sp, Other) ~ Work

trying - Work 
# weights:  6 (2 variable)
initial  value 762.436928 
final  value 687.227416 
converged
       Df      AIC
- Work  2 1378.455
<none>  4 1381.124
# weights:  6 (2 variable)
initial  value 762.436928 
final  value 687.227416 
converged

Step:  AIC=1378.45
cbind(An, Sp, Other) ~ 1

which leaves us with a null final model:

nmod

Call:
multinom(formula = cbind(An, Sp, Other) ~ 1, data = cns)

Coefficients:
      (Intercept)
Sp      0.2896333
Other  -0.9808293

Residual Deviance: 1374.455 
AIC: 1378.455

The fitted proportions are:

cc <- c(0,0.28963,-0.98083)
names(cc) <- c("An","Sp","Other")
exp(cc)/sum(exp(cc))

       An        Sp     Other 
0.3688767 0.4927946 0.1383287

So we find that water hardness and parents’ professions are related to the probability of a malformed birth, but that they have no effect on the type of malformation.

If we fit a multinomial logit model to all four categories:

mmmod <- multinom(cbind(NoCNS,An,Sp,Other) ~ Water + Work, cns)

# weights:  16 (9 variable)
initial  value 117209.801938 
iter  10 value 27783.394982
iter  20 value 20770.391265
iter  30 value 11403.855681
iter  40 value 5238.502151
iter  50 value 4717.065878
final  value 4695.482597 
converged

sum_mmmod <- summary(mmmod)

z <- sum_mmmod$coefficients/sum_mmmod$standard.errors
p <- (1 - pnorm(abs(z), 0, 1))*2
p

      (Intercept)       Water WorkNonManual
An              0 0.066695285    0.02338312
Sp              0 0.001835095    0.07077816
Other           0 0.840643463    0.02348335

We find that both Water and Work are significant, but that the fact that they do not distinguish the type of malformation is not easily discovered from this model.

3 Ordinal multinomial responses

Suppose we have $J$ ordered categories and $p_{ij} = \mathbb{P}(Y_i = j)$ for $j=1, \ldots, J$. For ordered responses, it is more convenient to work with the cumulative probabilities \[ \gamma_{ij} = \mathbb{P}(Y_i \le j). \] Since $\gamma_{iJ}=1$, we only need to model $J-1$ cumulative probabilities.
To link $\gamma$s to covariats $x$, we consider \[ g(\gamma_{ij}) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}, \] where $\theta_j$ have to be non-decreasing in $j$ to honor the ordering. In other words, $\boldsymbol{\beta}$ has a uniform effect on the response categories and each category has its own intercept. Again we have at least 3 choices for the link function: logit, probit or cloglog.
Latent variable interpretation.
- Suppose that $Z_i$ is some unobserved continuous variable that might be thought of as the real underlying latent response. We only observe a discretized version of $Z_i$ in the form of $Y_i$ where $Y_i = j$ is observed if $\theta_{j−1} < Z_i ≤ \theta_j$. Further suppose that $Z_i − \beta^T x_i$ has distribution $F$, then: \[ \gamma_{ij} = P(Y_i\leq j) = P(Z_i\leq \theta_j) = P(Z_i-\beta^Tx_i \leq \theta_j -\beta^Tx_i) = F(\theta_j-\beta^Tx_i) \]
- Now if, for example, F follows the logistic distribution, where $F(x) = e^x/(1+e^x)$, then:

\[ \gamma_{ij} = \frac{\text{exp}(\theta_j-\beta^Tx_i)}{1+\text{exp}(\theta_j-\beta^Tx_i)} \] and so we would have a logit model for the cumulative probabilities $\gamma_{ij}$. Choosing the normal distribution for the latent variable leads to a probit model, while the choice of an extreme value distribution leads to the complementary log-log. This latent variable explanation for the model is displayed as the following figure.

theta <- c(-2, -1, 2)
tibble(t = seq(-6, 6, 0.1), pdf = dlogis(t, location = 0, scale = 1)) %>%
  ggplot() + 
  geom_line(mapping = aes(x = t, y = pdf)) + 
  geom_segment(mapping = aes(x = theta[1], y = 0, xend = theta[1], yend = dlogis(theta[1]))) + 
  geom_segment(mapping = aes(x = theta[2], y = 0, xend = theta[2], yend = dlogis(theta[2]))) + 
  geom_segment(mapping = aes(x = theta[3], y = 0, xend = theta[3], yend = dlogis(theta[3]))) +
  labs(x = "t", y = "Density") + 
  annotate("text", x = theta[1], y = 0, label = expression(theta[1])) + 
  annotate("text", x = theta[2], y = 0, label = expression(theta[2])) + 
  annotate("text", x = theta[3], y = 0, label = expression(theta[3]))

Warning in geom_segment(mapping = aes(x = theta[1], y = 0, xend = theta[1], : All aesthetics have length 1, but the data has 121 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

Warning in geom_segment(mapping = aes(x = theta[2], y = 0, xend = theta[2], : All aesthetics have length 1, but the data has 121 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

Warning in geom_segment(mapping = aes(x = theta[3], y = 0, xend = theta[3], : All aesthetics have length 1, but the data has 121 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

3.1 Proportional odds model

The logit link function dictates \[ \log \frac{\gamma_{ij}}{1 - \gamma_{ij}} = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}, \quad j = 1,\ldots,J-1. \]
The corresponding inverse link function is \[ \gamma_{ij} = \mathbb{P}(Y_i \le j)= \frac{\exp (\theta_j - \mathbf{x}_i^T \boldsymbol{\beta})}{1 + \exp (\theta_j - \mathbf{x}_i^T \boldsymbol{\beta})}. \]

When $\beta_j > 0$, as $x_{ij}$ increases, $\gamma_{ij} = \mathbb{P}(Y_i \le j)$ decreases the same amount for all $j < J$, thus $Y_i$ is more likely to take the largest value $J$. This motivates the minus sign in the definition of the model since it allows easier interpretation of $\beta$.

It is called the proportional odds model because the relative odds for $y \le j$ comparing $\mathbf{x}_1$ and $\mathbf{x}_2$ are \[ \left( \frac{\gamma_j(\mathbf{x}_1)}{1 - \gamma_j(\mathbf{x}_1)} \right) / \left( \frac{\gamma_j(\mathbf{x}_2)}{1 - \gamma_j(\mathbf{x}_2)} \right) = \exp (- (\mathbf{x}_1 - \mathbf{x}_2)^T \boldsymbol{\beta}), \] which do not dependent on $j$. Here $\gamma_j(\mathbf{x}_i) = \gamma_{ij}$.
We can fit a proportional odds model using the polr function from the MASS library

library(MASS)
pomod <- polr(sPID ~ age + educ + nincome, data = nes96)
summary(pomod)

Call:
polr(formula = sPID ~ age + educ + nincome, data = nes96)

Coefficients:
            Value Std. Error  t value
age      0.005775   0.003887  1.48581
educ.L   0.724087   0.384388  1.88374
educ.Q  -0.781361   0.351173 -2.22500
educ.C   0.040168   0.291762  0.13767
educ^4  -0.019925   0.232429 -0.08573
educ^5  -0.079413   0.191533 -0.41462
educ^6  -0.061104   0.157747 -0.38735
nincome  0.012739   0.002140  5.95187

Intercepts:
                       Value   Std. Error t value
Democrat|Independent    0.6449  0.2435     2.6479
Independent|Republican  1.7374  0.2493     6.9694

Residual Deviance: 1984.211 
AIC: 2004.211

Stepwise regression leads to a model with only one predictor nincome.

pomodi <- step(pomod)

Start:  AIC=2004.21
sPID ~ age + educ + nincome

          Df    AIC
- educ     6 2002.8
<none>       2004.2
- age      1 2004.4
- nincome  1 2038.6

Step:  AIC=2002.83
sPID ~ age + nincome

          Df    AIC
- age      1 2001.4
<none>       2002.8
- nincome  1 2047.2

Step:  AIC=2001.36
sPID ~ nincome

          Df    AIC
<none>       2001.4
- nincome  1 2045.3

summary(pomodi)

Call:
polr(formula = sPID ~ nincome, data = nes96)

Coefficients:
          Value Std. Error t value
nincome 0.01312   0.001971   6.657

Intercepts:
                       Value   Std. Error t value
Democrat|Independent    0.2091  0.1123     1.8627
Independent|Republican  1.2916  0.1201    10.7526

Residual Deviance: 1995.363 
AIC: 2001.363

Analysis of deviance also justifies the model with only nincome.

c(deviance(pomodi), pomodi$edf)

[1] 1995.363    3.000

which can be compared to the corresponding multinomial logit model

c(deviance(pomod), pomod$edf)

[1] 1984.211   10.000

We see the proportional odds model is justifiable.

pchisq(deviance(pomodi) - deviance(pomod), 
       pomod$edf - pomodi$edf, 
       lower.tail = FALSE)

[1] 0.1321517

Interpretation of coefficients.
- The odds of moving from Democratic to Independent or from Independent to Republican increases by a factor of

    exp(pomodi$coef[1])

 nincome 
1.013206

as income increases by one unit ($1000).

For income of $0, the predicted probability of being a Democrat is

    ilogit(pomodi$zeta[1])

Democrat|Independent 
           0.5520865

and that of being an `Independent` is

    ilogit(pomodi$zeta[2]) - ilogit(pomodi$zeta[1])

Independent|Republican 
             0.2323241

and that of being a `Republican` is

    1 - ilogit(pomodi$zeta[2])

Independent|Republican 
             0.2155895

The predicted probabilities of each category at each income level is

l <- c(8, 26, 42, 58, 74, 90, 107)
predict(pomodi, data.frame(nincome = il), type = "probs")

   Democrat Independent Republican
1 0.5260129   0.2401191  0.2338679
2 0.4670450   0.2541588  0.2787962
3 0.4153410   0.2617693  0.3228897
4 0.3654362   0.2641882  0.3703756
5 0.3182635   0.2612285  0.4205080
6 0.2745456   0.2531189  0.4723355
7 0.2324161   0.2395468  0.5280371

library(ggeffects)
plot <- ggpredict(pomodi, "nincome")
ggplot(plot, 
       aes(x = x, y = predicted, group = response.level, color = response.level)) + 
  geom_line() +
  xlab("income")

tibble(x = seq(-4, 4, by = 0.05),
       y = dlogis(x)) %>%
  ggplot() +
  geom_line(aes(x = x, y = y)) + 
  geom_vline(xintercept = c(0.209, 1.292)) + 
  geom_vline(xintercept = c(0.209,1.292) - 50*0.013120, linetype = "dotted") +
  geom_vline(xintercept = c(0.209,1.292) -100*0.013120, linetype = "longdash") + 
  labs(y = "dlogis(x)")

The probability of being a Democrat is given by the area lying to the left of the leftmost of each pair of lines, while the probability of being a Republican is given by the area to the right of the rightmost of the pair. Independents are represented by the area in-between.

3.2 Ordered probit model

If we use the probit link, then \[ \Phi^{-1}(\gamma_j(\mathbf{x}_i)) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}, \quad j=1,\ldots,J-1. \]

opmod <- polr(sPID ~ nincome, method = "probit", data = nes96)
summary(opmod)

Call:
polr(formula = sPID ~ nincome, data = nes96, method = "probit")

Coefficients:
           Value Std. Error t value
nincome 0.008182   0.001208   6.775

Intercepts:
                       Value   Std. Error t value
Democrat|Independent    0.1284  0.0694     1.8510
Independent|Republican  0.7976  0.0722    11.0399

Residual Deviance: 1994.892 
AIC: 2000.892

The deviance is similar to the logit link, but the coefficients appear to be different. The predictions are similar.

l <- c(8, 26, 42, 58, 74, 90, 107)
predict(opmod, data.frame(nincome = il), type = "probs")

   Democrat Independent Republican
1 0.5251020   0.2428478  0.2320502
2 0.4664030   0.2542673  0.2793298
3 0.4147945   0.2602623  0.3249432
4 0.3646178   0.2620370  0.3733452
5 0.3166610   0.2595041  0.4238349
6 0.2716037   0.2527879  0.4756084
7 0.2275119   0.2414351  0.5310530

3.3 Proportional hazards model

Suppose we use the cloglog link \[ \log (- \log (1 - \gamma_j(\mathbf{x}_i))) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}. \]
The hazard of category $j$ is the probability of falling in category $j$ given that your category is greater than $j$ \[ \text{Hazard}(j) = \mathbb{P}(Y_i = j \mid Y_i \ge j) = \frac{\mathbb{P}(Y_i = j)}{\mathbb{P}(Y_i \ge j)} = \frac{\gamma_{ij} - \gamma_{i,j-1}}{1 - \gamma_{i,j-1}}. \] The quantity $- \log \mathbb{P}(Y > j)$ is called the cumulative hazard function.
Since \[\begin{eqnarray*} 1 - \gamma_j(\mathbf{x}) &=& \mathbb{P}(Y > j) = e^{- e^{\theta_j - \mathbf{x}^T \boldsymbol{\beta}}}, \end{eqnarray*}\] we have \[\begin{eqnarray*} \log \mathbb{P}(Y > j) = - e^{\theta_j - \mathbf{x}^T \boldsymbol{\beta}} \end{eqnarray*}\] and \[ \frac{- \log \mathbb{P}(Y > j \mid \mathbf{x}_1)}{- \log \mathbb{P}(Y > j \mid \mathbf{x}_2)} = e^{(\mathbf{x}_2 - \mathbf{x}_1)^T \boldsymbol{\beta}} \] or \[ \mathbb{P}(Y > j \mid \mathbf{x}_1) = [\mathbb{P}(Y > j \mid \mathbf{x}_2)]^{\exp (\mathbf{x}_2 - \mathbf{x}_1)^T \boldsymbol{\beta}}. \] It is called the proportional hazards model because the ratio of cumulative hazards does not depend on level $j$.

polr(sPID ~ nincome, method = "cloglog", data = nes96)

Call:
polr(formula = sPID ~ nincome, data = nes96, method = "cloglog")

Coefficients:
    nincome 
0.008271138 

Intercepts:
  Democrat|Independent Independent|Republican 
            -0.2866703              0.4564409 

Residual Deviance: 2003.96 
AIC: 2009.96

We see a relatively worse fit than proportional odds and ordered probit models.