Display system information and load tidyverse and faraway packages
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.5.2 fastmap_1.2.0 cli_3.6.5
[5] tools_4.5.2 htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10
[9] rmarkdown_2.29 knitr_1.50 jsonlite_2.0.0 xfun_0.56
[13] digest_0.6.37 rlang_1.1.7 evaluate_1.0.5
library(tidyverse)library(faraway)
faraway package contains the datasets in the ELMR book.
1 Heart disease example
The dataframe wcgs in faraway package contains data from the Western Collaborative Group Study.
wcgs %>%head(10)
age height weight sdp dbp chol behave cigs dibep chd typechd timechd
1 49 73 150 110 76 225 A2 25 A no none 1664
2 42 70 160 154 84 177 A2 20 A no none 3071
3 42 69 160 110 78 181 B3 0 B no none 3071
4 41 68 152 124 78 132 B4 20 B no none 3064
5 59 70 150 144 86 255 B3 20 B yes infdeath 1885
6 44 72 204 150 90 182 B4 0 B no none 3102
7 44 72 164 130 84 155 B4 0 B no none 3074
8 40 71 150 138 60 140 A2 0 A no none 3071
9 43 72 190 146 76 149 B3 25 B no none 3064
10 42 70 175 132 90 325 A2 0 A no none 1032
arcus
1 absent
2 present
3 absent
4 absent
5 present
6 absent
7 absent
8 absent
9 absent
10 present
We convert the dataframe into a tibble for compatibility with tidyverse.
wcgs <- wcgs %>%as_tibble() %>%print(width =Inf)
# A tibble: 3,154 × 13
age height weight sdp dbp chol behave cigs dibep chd typechd
<int> <int> <int> <int> <int> <int> <fct> <int> <fct> <fct> <fct>
1 49 73 150 110 76 225 A2 25 A no none
2 42 70 160 154 84 177 A2 20 A no none
3 42 69 160 110 78 181 B3 0 B no none
4 41 68 152 124 78 132 B4 20 B no none
5 59 70 150 144 86 255 B3 20 B yes infdeath
6 44 72 204 150 90 182 B4 0 B no none
7 44 72 164 130 84 155 B4 0 B no none
8 40 71 150 138 60 140 A2 0 A no none
9 43 72 190 146 76 149 B3 25 B no none
10 42 70 175 132 90 325 A2 0 A no none
timechd arcus
<int> <fct>
1 1664 absent
2 3071 present
3 3071 absent
4 3064 absent
5 1885 present
6 3102 absent
7 3074 absent
8 3071 absent
9 3064 absent
10 1032 present
# ℹ 3,144 more rows
For now, we focus just on variables
- chd, whether the person develops coronary heart disease or not,
- height, height of the person in inches,
- cigs, number of cigarettes smoked per day.
wcgs %>%select(chd, height, cigs) %>%summary()
chd height cigs
no :2897 Min. :60.00 Min. : 0.0
yes: 257 1st Qu.:68.00 1st Qu.: 0.0
Median :70.00 Median : 0.0
Mean :69.78 Mean :11.6
3rd Qu.:72.00 3rd Qu.:20.0
Max. :78.00 Max. :99.0
We use side-by-side boxplots to summarize the qulitative variable chd and quantitative variables height and cigs
ggplot(data = wcgs) +geom_boxplot(mapping =aes(x = chd, y = height))
and number of cigarretes smoked per day
ggplot(data = wcgs) +geom_boxplot(mapping =aes(x = chd, y = cigs))
It seems more cigarettes is associated with heard disease, but not height. How can we formally analyze this? If we use linear regression (straight line) for the anlaysis, the line will eventually extends beyond the [0, 1] range, making interpretation hard.
ggplot(data = wcgs) +geom_point(mapping =aes(x = cigs, y = chd))
2 Logistic regression
Bernoulli model for a binary response \[
Y_i = \begin{cases}
1 & \text{with probability } p_i \\
0 & \text{with probability } 1 - p_i
\end{cases}
\]
The parameter \(p_i = \mathbb{E}(Y_i)\) will be related to the predictors \(X_1, \ldots, X_{q}\) via an inverse link function\[
p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}},
\] where \(\eta_i\) is the linear predictor or systematic component\[
\eta_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{q} x_{iq} = \mathbf{x}_i^T \boldsymbol{\beta}
\] with \[
\boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_q \end{pmatrix}, \quad \mathbf{x}_i = \begin{pmatrix} 1 \\ x_{i1} \\ \vdots \\ x_{iq} \end{pmatrix}.
\]
The function \[
\eta = g(p) = \log \left( \frac{p}{1-p} \right)
\] that links \(\mathbb{E}(Y)\) to the systematic component is called the link function. This particular link function is also called the logit function.
The function \[
p = g^{-1}(\eta) = \frac{e^\eta}{1 + e^\eta}
\] is called the inverse link function. This particular function (inverse logit) is also called the logistic function. A graph of the logistic function:
ggplot(data =tibble(x =0), mapping =aes(x = x)) +# null datastat_function(fun = ilogit) +# ilogit is from farawayxlim(-6, 6) +labs(x =expression(eta), y ="p", title ="Logistic function (inverse link)")
Exercise: Before we attempt to interpret the results from logistic regression, we first need to understand how the data is transformed to \((y_i, \mathbf{x}_i)\).
How to interpret the regression coefficients in logistic regression? Remember \[
\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 \cdot \text{height} + \beta_2 \cdot \text{cigs}.
\] The quantity \[
o = \frac{p}{1-p}
\] is called odds (of an event).
Therefore \(\beta_1\) can be interpreted as a unit increase in \(x_1\) with \(x_2\) held fixed increases the log-odds of success by \(\beta_1\), or increase the odds of success by a factor of \(e^{\beta_1}\).
The gtsummary package presents regression results in a much nicer way that facilitates interpretation. Summarize the log-odds:
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Exercise: Interpret the regression coefficients from wcgs fit.
# same as lmod$coefficients# coef(lmod) is a named numeric vector(beta_hat <-unname(coef(lmod)))
[1] -4.50161397 0.02520779 0.02312740
exp(beta_hat)
[1] 0.01109108 1.02552819 1.02339691
How to interpret the effect of a pack a day (20 cigarettes) on heart disease?
exp(beta_hat[3] *20)
[1] 1.588115
Suppose the probability of success in the presence of some condition is \(p_1\) and \(p_2\) in its absence. The relative risk or risk ratio is \(p_1 / p_2\). For example, the predicted probability of a 68in tall person who smokes a pack (20 cigarettes) a day and who does not smoke are, respectively
(p1 <-ilogit(sum(beta_hat *c(1, 68, 20))))
[1] 0.08907868
and
(p2 <-ilogit(sum(beta_hat *c(1, 68, 0))))
[1] 0.05800425
Then the relative risk is
p1 / p2
[1] 1.535727
When the probability of event is very small (rare disease assumption), i.e., \(p_1, p_2 \approx 0\), then the odds ratio is approximately equal to the risk ratio \[
\frac{o_1}{o_2} = \frac{p_1 / (1 - p_1)}{p_2 / (1 - p_2)} \approx \frac{p_1}{p_2}.
\]
5 Inference (analysis of deviance)
The deviance is defined as the difference of likelihoods between the fitted model and the saturated model (perfect fit model). \[
\begin{eqnarray*}
D &=& -2\log lik(\hat \beta) + 2 \log lik (\text{saturated model)}
\end{eqnarray*}
\] It comes from the likelihood ratio test (LRT) statistic \[
2 \log \frac{L_{\Omega}}{L_{\omega}},
\] where \(\Omega\) is the full/saturated model (same number of parameters as observations) and \(\omega\) is the smaller model.
The usual goodness of fit test using \(\chi_{n-q-1}^2\) asymptotic null distribution can not be applied here since we only have a single observation for each predictor pattern. This is different from the binomial model in next chapter. The Hosmer-Lemeshow test partitions the predicted probailities into \(J\) bins and then carries out a Pearson \(X^2\) type test to assess the goodness of fit (ELMR 2.6).
In the model output, the residual deviance, denoted \(D_L\), is the devience of the current model and the null deviance, denoted \(D_S\), is the deviance of the model with just an intercept term. Assuming the null model, the test statistic \(D_S - D_L\) is asymptotically distributed \(\chi_{\ell - s}^2\). In our case, the test statistic is
We do not expect the raw residuals to have equal variance because the binary variance is \(p(1 - p)\).
The deviance residuals are standardized residuals defined by \[
d_i = \text{sign}(y_i - \widehat{p}_i) \sqrt{-2 [y_i \log\widehat{p}_i + (1 - y_i) \log(1 - \widehat{p}_i)]}.
\] Note \[
\sum_i d_i^2 = \text{deviance}
\] in analogy to \(\sum_i \widehat{\epsilon}_i^2 = \text{RSS}\) in linear regression. The term \(\text{sign}(y_i - \widehat{p}_i)\) ensures that \(d_i\) has the same sign as raw residual \(y_i - \widehat{p}_i\).
Again we see the residuals are clustered into two lines: the upper one corresponding to \(y_i=1\) and the lower one to \(y_i=0\). We can improve this plot by binning: divide the range of linear predictor into 100 bins of roughly equal points and plot average residual against average linear predictors per bin.
What this plot tells us is there is no obvious association between the deviance residuals and the linear predictor. If there is a pattern, it suggests that the model is not correctly specified. We may need to add more predictors or transform the existing predictors.
Question: is there a concern that the deviance residuals are not centered around 0?
Exercise: Do similar binned plots for deviance residuals against height and deviance residuals vs cigs to check the linearity assumption.
Previous year’s bonus exercise: Simulate data for a logistic regression model with a quadratic term (e.g., \(X_1^2\)) as the true model and check the linearity assumption using the following plots:
Binned deviance residuals against linear predictor \(X_1\) when you model the systematic component as a linear function of the predictors
Binned deviance residuals against the quadratic term \(X_1^2\) when you model the systematic component as a quadratic function of the predictors
Binned deviance residuals against fitted value (\(\hat\eta\)) when you model the systematic component as a quadratic function of the predictors
Binned deviance residuals against fitted value (\(\hat\eta\)) when you model the systematic component as a linear function of the predictors
Scatter plot of logit(binned \(Y\)) and \(X_1^2\): break the range of \(X1\) into bins, and within each bin, calculate the mean value of \(X1\) and \(Y\) for observations in that bin. We then transform the mean of \(Y\) through the link function
Scatter plot of logit(binned \(Y\)) and \(\hat\eta\)
Scatter plot of logit(binned \(Y\)) and \(X_1\)
QQ plot is not helpful since there is no reason these deviance residuals are approximately standard normal.
qqnorm(devres)
Half-normal plot (hat values against half-normal quantiles) can help detect unusual cases in predictor space (high leverage cases). For logistic regression, we use the generalized hat matrix \[
\mathbf{H} = \widehat{\mathbf{W}}^{1/2} \mathbf{X}^T (\mathbf{X}^T \widehat{\mathbf{W}} \mathbf{X})^{-1} \mathbf{X} \widehat{\mathbf{W}}^{1/2},
\] where \[
\widehat{\mathbf{W}} = \begin{pmatrix}
\widehat{p}_1 (1 - \widehat{p}_1) & \\
& \ddots & \\
& & \widehat{p}_n (1 - \widehat{p}_n)
\end{pmatrix}.
\]
halfnorm(hatvalues(lmod))
We see two high leverage cases, who smoke an unusual number of cigarettes per day!
Intuitively if we divide observations into \(J\) bins according to linear predictors \(\eta\), then \(y_j / n_j\) (observed proportion of “successes”) for \(j\)-th bin should be close to the average predicted probabilities in that bin.
wcgs_binned %>%ggplot(mapping =aes(x = avgpred, y = y / count)) +geom_point() +geom_linerange(mapping =aes(ymin = y / count -2* se_fit,ymax = y / count +2* se_fit), alpha =0.5) +geom_abline(intercept =0, slope =1) +labs(x ="Predicted probability", y ="Observed proportion")
The Hosmer-Lemeshow test formalizes this idea by testing the statistic \[
X_{\text{HL}}^2 = \sum_{j=1}^J \frac{(y_j - m_j \widehat{p}_j)^2}{m_j \widehat{p}_j (1 - \widehat{p}_j)}
\] against the \(\chi^2\) distribution with \(J-1\) degrees of freedom.
We see a moderate p-value, which indicates no lack of fit.
8.2 ROC curve
Logistic regression is often used as a tool for classification.
If we choose a threshold, say 0.2, then the predicted probabilities give a classification rule \[
\text{case $i$ is a} \begin{cases}
\text{"success"} & \text{if } \widehat{p}_i \ge 0.2 \\
\text{"failure"} & \text{if } \widehat{p}_i < 0.2
\end{cases}.
\]
wcgs %>%mutate(predprob =predict(lmod, type ="response")) %>%mutate(predout =ifelse(predprob >=0.2, "yes", "no")) %>%xtabs(~ chd + predout, data = .)
predout
chd no yes
no 2886 11
yes 254 3
With this classification rule, we see the error rate is about
(11+254) / (2886+254+11+3)
[1] 0.08402029
The sensitivity is \[
\frac{\text{TP}}{\text{TP + FN}} = \frac{3}{257} = 1.17\%
\] and the specificity is \[
\frac{\text{TN}}{\text{FP + TN}} = \frac{2886}{11 + 2886} = 99.62\%
\]
If we lower the threshold, then we increase the sensitivity but decrease the specificity. If we plot sensitivity against 1-specificity by varying the threshold, then we get the receiver operating characteristic (ROC) curve.
The area under the curve (AUC) is a measure of the overal classification of the classifier. Larger AUC means better classification performance.
auc(lmod_roc)
Area under the curve: 0.6007
9 Model selection by AIC
We can perform sequential search using the Akaike information criterion (AIC) \[
\text{AIC} = \text{deviance} + 2q.
\] We start from a rich enough model
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
10 Model selection by lasso
A modern approach for model selection is the lasso, which minimizes the function \[
- n^{-1} \ell(\boldsymbol{\beta}) + \lambda \sum_{j=1}^q |\beta_j|,
\]
where \(\ell(\boldsymbol{\beta})\) is the log-likelihood of logistic regression and \(\lambda>0\) is a tuning parameter. We notice that
when \(\lambda = \infty\), all non-intercept regression coefficients will be pushed to 0, and
when \(\lambda = 0\), the regression coefficients are same as those from regular logistic regression.
If we vary \(\lambda\) from 0 to larger values, we will obtain intermediate models with lesser and lesser preditors. This way we are achieving continuous model selection.
How do we choose \(\lambda\), which determines the model size? One natural idea is to split the data into a training set and a validation set. The training set is used to fit the logistic regression at different \(\lambda\) values. Then the validation set is used to evaluate and compare the performance of different models. We will choose the model that gives the best performance on the validation set.
# A tibble: 3,140 × 11
age height weight sdp dbp chol cigs dibep chd arcus bmi
<int> <int> <int> <int> <int> <int> <int> <fct> <fct> <fct> <dbl>
1 49 73 150 110 76 225 25 A no absent 1445.
2 42 70 160 154 84 177 20 A no present 1607.
3 42 69 160 110 78 181 0 B no absent 1630.
4 41 68 152 124 78 132 20 B no absent 1571.
5 59 70 150 144 86 255 20 B yes present 1506.
6 44 72 204 150 90 182 0 B no absent 1992.
7 44 72 164 130 84 155 0 B no absent 1601.
8 40 71 150 138 60 140 0 A no absent 1485.
9 43 72 190 146 76 149 25 B no absent 1855.
10 42 70 175 132 90 325 0 A no present 1758.
# ℹ 3,130 more rows
split data into 80% training cases and 20% validation cases.
library(glmnet)library(caret)library(themis) # provides up/down-sampling methods for the data# set seed for reproducibilityset.seed(200)# list = FALSE, request result to be in a matrix (of row position) not a listtraining_samples <- wcgs$chd %>%createDataPartition(p =0.8, list =FALSE)# (train_data <- wcgs %>%# slice(training_samples[, 1]))# (val_data <- wcgs %>%# slice(-training_samples[, 1]))
The glmnet package takes a matrix (of predictors) and a vector (of responses) as input. We use model.matrix function to create them. glmnet will add intercept by default, so we drop intercept term when forming x matrix.
# X and y from original datax_all <-model.matrix( chd ~-1+ age + height + weight + bmi + sdp + dbp + chol + dibep + cigs + arcus, data = wcgs)y_all <-ifelse(wcgs$chd =="yes", 1, 0)# training X and yx_train <- x_all[training_samples[, 1], ]y_train <- y_all[training_samples[, 1]]# validation X and yx_val <- x_all[-training_samples[, 1], ]y_val <- y_all[-training_samples[, 1]]
Fit lasso regression and plot solution path:
lasso_fit <-glmnet(x_train, y_train, family ="binomial", alpha =1)summary(lasso_fit)
\(\alpha = 1\) corresponds to the lasso regression, in the family of elastic net penalties\[
(1 - \alpha) \|\boldsymbol{\beta}\|_2^2 / 2 + \alpha \|\boldsymbol{\beta}\|_1.
\]
Other choises for xvar are "lambda" for log lambda value, "norm" for the \(\ell_1\)-norm of the coefficients (default), and "dev" for the percentage of deviance explained.
Now we can evaluate the performance of the models (corresponding to different \(\lambda\) values) on the validation set.
# predict validation case probabilities at different \lambda values and calculate test deviancepred_val <-predict(lasso_fit, newx = x_val, type ="response", s = lasso_fit$lambda)dev_val <--2*colSums(y_val *log(pred_val) + (1- y_val) *log(1- pred_val))tibble(lambda = lasso_fit$lambda, dev_val = dev_val) %>%ggplot() +geom_point(mapping =aes(x = lambda, y = dev_val)) +scale_x_log10() +labs(y ="Binomial deviance on validation set", x ="Lambda")
From the graph, it seems that \(\lambda = 0.005\) yields a model that performs best on the validation set.
Now the question is whether we should make decision on this single training-validation split? A common strategy is to use cross validation.
set.seed(200)cv_lasso <-cv.glmnet(x_all, y_all, alpha =1, family ="binomial", type.measure ="auc")plot(cv_lasso)
The plot displays the cross-validation error (deviance by default, we chose AUC here) according to the log of lambda. The left dashed vertical line indicates that the log of the optimal value of log lambda is approximately -5.2, which is the one that minimizes the average AUC. This lambda value will give the most accurate model. The exact value of lambda and corresponding model can be viewed as follow:
cv_lasso$lambda.min
[1] 0.005696827
coef(cv_lasso, cv_lasso$lambda.min)
12 x 1 sparse Matrix of class "dgCMatrix"
s=0.005696827
(Intercept) -1.040951e+01
age 4.954011e-02
height .
weight 5.334625e-03
bmi .
sdp 1.537576e-02
dbp .
chol 9.228292e-03
dibepA 5.162308e-01
dibepB -5.494421e-15
cigs 1.656021e-02
arcuspresent 9.149817e-02
We see that this model differs from the best model chosen by AIC by replacing the predictor bmi by weight.