Display system information and load tidyverse and faraway packages
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.5.2 fastmap_1.2.0 cli_3.6.5
[5] tools_4.5.2 htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10
[9] rmarkdown_2.29 knitr_1.50 jsonlite_2.0.0 xfun_0.56
[13] digest_0.6.37 rlang_1.1.7 evaluate_1.0.5
library(tidyverse)library(faraway)
faraway package contains the datasets in the ELMR book.
1 Maximum Likelihood
Consider \(n\) independent observations \(Y_1, Y_2, \ldots, Y_n\) from a distribution with density \(f(y\mid \theta)\), where \(\theta\) is the, possibly vector-valued, parameter. Suppose we observe \(\mathbf{y} = (y_1,\ldots, y_n)^T\), then we define the likelihood as \[
L(\theta\mid y) = P(\mathbf{Y} = \mathbf{y}) = f(y_1,\ldots, y_n\mid \theta) = \prod_{i=1}^n f(y_i\mid \theta)
\]
The likelihood is a function of the parameter(s) given the data and is the probability of the observed data given a specified value of the parameter(s).
For continuous data, the likelihood is a density function and is not a probability. For continuous random variables, \(Y_1, Y_2, \ldots, Y_n\), with probability density function \(f(y\mid \theta)\). For \(y_i\), \[
P(Y_i = y_i) = P(y_i \leq Y_i \leq y_i + dy) = f(y_i\mid \theta)dy
\]
The likelihood is the joint density of the data, \(f(y_1,\ldots, y_n\mid \theta)\), evaluated at the observed data, \(\mathbf{y}\). \[
L(\theta\mid y) \approx \prod_{i=1}^n f(y_i\mid \theta)
\]
For example, suppose that \(Y\) is binomially distributed \(B(n, p)\). The likelihood is \[
L(p\mid y) = P(Y = y) = {n \choose y} p^y(1-p)^{n-y}
\]
The maximum likelihood estimate (MLE) is the the parameter(s) (\(\theta\)) that gives the largest probability to the observed data, or in other words, MLE of \(\theta\) is the value of \(\theta\) that maximizes the likelihood function. The MLE is denoted by \(\hat{\theta}\).
In most cases, it is easier to maximize the log of likelihood function, \(l(\theta\mid y) = logL(\theta \mid y)\). Since log is a monotone increasing function, the maximum occurs at the same \(\hat \theta\).
Example, for binoimal distribution, the log likelihood is \[
l(p\mid y) = logL(p\mid y) = \log{n \choose y} + y\log(p) + (n-y)\log(1-p)
\]
2 Estimation
The score function is the derivative of the log likelihood with respect to the parameter(s). \[
u(p\mid y) = \frac{\partial l(p\mid y)}{\partial p} = \frac{y}{p} - \frac{n-y}{1-p}
\]
We can find the maximum likelihood estimate \(\hat p\) by solving \(u(p) = 0\). We get \(\hat p = y/n\). We should also verify that this stationary point actually represents a maximum, i.e., second derivative is negative.
Usually we want more than an estimate; some measure of the uncertainty in the estimate is valuable. This can be obtained via the Fisher information which is: \[
I(\theta) = \mbox{var}\,\, u(\theta) = -\mbox{E}\left[\frac{\partial^2 l(\theta\mid y)}{\partial \theta^2}\right]
\]
One can show that the variance of \(\hat{\theta}\) can be estimated by: \[
\mbox{var}(\hat{\theta}) = \frac{1}{I(\theta)}
\] under mild regularity conditions. For the binomial example this gives: \[
\mbox{var}(\hat{p}) = \frac{\hat{p}(1-\hat{p})}{n}
\]
Examples where likelihood can be maximized explicitly are confined to simple cases. Typically, numerical optimization is necessary. The Newton–Raphson method is the most well-known technique. \[
\theta_{k+1} = \theta_k - H^{-1}(\theta_k) u(\theta_k)
\] where \(H(\theta)\) is the Hessian matrix of second derivatives of the log likelihood function evaluated at \(\theta\). \[
H(\theta) = \frac{\partial^2 l(\theta\mid y)}{\partial \theta \partial \theta^T}
\] We iterate this method, putting \(\theta_{k+1}\) in place of \(\theta_{k}\) and so on, until the procedure (hopefully) converges. The Fisher scoring method replaces \(H\) with \(-I\) and sometimes gives superior results. This method is used in fitting GLMs and is equivalent to iteratively reweighted least squares.