To submit homework, please submit Rmd and html files to bruinlearn by the deadline.
An initial data analysis that explores the numerical and graphical characteristics of the data.
Variable selection to choose the best model.
An exploration of transformations to improve the fit of the model.
Diagnostics to check the assumptions of your model.
Some predictions of future observations for interesting values of the predictors.
An interpretation of the meaning of the model by writing a scientific abstract. (<150 words)
BACKGROUND: brief intro of the study background, what are the existing findings
OBJECTIVE: state the overall purpose of your research, e.g., what kind of knowledge gap you are trying to fill in
METHODS: study design (how these data were collected), outcome definitions, statistical procedures used
RESULTS: summary of major findings to address the question raised in objective
CONCLUSIONS:
Write down the log-likelihood function of logistic regresion for binomial responses.
Derive the gradient vector and Hessian matrix of the log-likelhood function with respect to the regression coefficients \(\boldsymbol{\beta}\).
Show that the log-likelihood function of logistic regression is a concave function in regression coefficients \(\boldsymbol{\beta}\). (Hint: show that the negative Hessian is a positive semidefinite matrix.)
The National Institute of Diabetes and Digestive and Kidney Diseases
conducted a study on 768 adult female Pima Indians living near Phoenix.
The purpose of the study was to investigate factors related to diabetes.
The data may be found in the the dataset pima
.
Create a factor version of the test results and use this to produce an interleaved histogram to show how the distribution of insulin differs between those testing positive and negative. Do you notice anything unbelievable about the plot?
Replace the zero values of insulin
with the missing
value code NA
. Recreatethe interleaved histogram plot and
comment on the distribution.
Replace the incredible zeroes in other variables with the missing value code. Fit a model with the result of the diabetes test as the response and all the other variables as predictors. How many observations were used in the model fitting? Why is this less than the number of observations in the data frame.
Refit the model but now without the insulin and triceps predictors. How many observations were used in fitting this model? Devise a test to compare this model with that in the previous question.
Use AIC to select a model. You will need to take account of the missing values. Which predictors are selected? How many cases are used in your selected model?
Create a variable that indicates whether the case contains a missing value. Use this variable as a predictor of the test result. Is missingness associated with the test result? Refit the selected model, but now using as much of the data as reasonable. Explain why it is appropriate to do this.
library(faraway)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
pima <- pima %>%
mutate(
glucose2 = ifelse(glucose == 0, NA, glucose),
diastolic2 = ifelse(diastolic == 0, NA, diastolic),
triceps2 = ifelse(triceps == 0, NA, triceps),
insulin2 = ifelse(insulin == 0, NA, insulin),
bmi2 = ifelse(bmi == 0, NA, bmi),
diabetes2 = ifelse(diabetes == 0, NA, diabetes),
age2 = ifelse(age == 0, NA, age)) %>%
mutate(has_missing = rowSums(is.na(select(., contains("2"))))) %>%
mutate(has_missing = ifelse(has_missing > 0, 1, 0))
missing.glm <- glm(test ~ has_missing, family = binomial(), data = pima)
library(gtsummary)
## #StandWithUkraine
missing.glm %>%
tbl_regression() %>%
bold_labels() %>%
bold_p(t = 0.05)
Characteristic | log(OR)1 | 95% CI1 | p-value |
---|---|---|---|
has_missing | 0.16 | -0.14, 0.45 | 0.3 |
1 OR = Odds Ratio, CI = Confidence Interval |
From above regression, we found missingness was not associate with outcome. This means that the distribution of outcome when removing data with missing is still a representative of the original distribution. This justifies the use of “complete case” analysis.
Using the last fitted model of the previous question, what is the difference in the odds of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant? Give a confidence interval for this difference.
Do women who test positive have higher diastolic blood pressures? Is the dias- tolic blood pressure significant in the regression model? Explain the distinction between the two questions and discuss why the answers are only apparently contradictory.