Likelihood Theory (ELMR Appendix)

Biostat 200C

Author

Dr. Jin Zhou @ UCLA

Published

April 6, 2026

Display system information and load tidyverse and faraway packages

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.2    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.5.2       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.50        jsonlite_2.0.0    xfun_0.56        
[13] digest_0.6.37     rlang_1.1.7       evaluate_1.0.5

library(tidyverse)
library(faraway)

faraway package contains the datasets in the ELMR book.

1 Maximum Likelihood

Consider \(n\) independent observations \(Y_1, Y_2, \ldots, Y_n\) from a distribution with density \(f(y\mid \theta)\), where \(\theta\) is the, possibly vector-valued, parameter. Suppose we observe \(\mathbf{y} = (y_1,\ldots, y_n)^T\), then we define the likelihood as
\[ L(\theta\mid y) = P(\mathbf{Y} = \mathbf{y}) = f(y_1,\ldots, y_n\mid \theta) = \prod_{i=1}^n f(y_i\mid \theta) \]
The likelihood is a function of the parameter(s) given the data and is the probability of the observed data given a specified value of the parameter(s).
For continuous data, the likelihood is a density function and is not a probability. For continuous random variables, \(Y_1, Y_2, \ldots, Y_n\), with probability density function \(f(y\mid \theta)\). For \(y_i\), \[ P(Y_i = y_i) = P(y_i \leq Y_i \leq y_i + dy) = f(y_i\mid \theta)dy \]
The likelihood is the joint density of the data, \(f(y_1,\ldots, y_n\mid \theta)\), evaluated at the observed data, \(\mathbf{y}\). \[ L(\theta\mid y) \approx \prod_{i=1}^n f(y_i\mid \theta) \]
For example, suppose that \(Y\) is binomially distributed \(B(n, p)\). The likelihood is \[ L(p\mid y) = P(Y = y) = {n \choose y} p^y(1-p)^{n-y} \]
The maximum likelihood estimate (MLE) is the the parameter(s) (\(\theta\)) that gives the largest probability to the observed data, or in other words, MLE of \(\theta\) is the value of \(\theta\) that maximizes the likelihood function. The MLE is denoted by \(\hat{\theta}\).
In most cases, it is easier to maximize the log of likelihood function, \(l(\theta\mid y) = logL(\theta \mid y)\). Since log is a monotone increasing function, the maximum occurs at the same \(\hat \theta\).
Example, for binoimal distribution, the log likelihood is \[ l(p\mid y) = logL(p\mid y) = \log{n \choose y} + y\log(p) + (n-y)\log(1-p) \]

2 Estimation

The score function is the derivative of the log likelihood with respect to the parameter(s). \[ u(p\mid y) = \frac{\partial l(p\mid y)}{\partial p} = \frac{y}{p} - \frac{n-y}{1-p} \]
We can find the maximum likelihood estimate \(\hat p\) by solving \(u(p) = 0\). We get \(\hat p = y/n\). We should also verify that this stationary point actually represents a maximum, i.e., second derivative is negative.
Other properties of maximum likelihood estimators include consistency, sufficiency, asymptotic efficiency and asymptotic normality. These are discussed in books such as A Course in Large Sample Theory or Kalbfleisch (1985, Chapters 1 and 2).
Usually we want more than an estimate; some measure of the uncertainty in the estimate is valuable. This can be obtained via the Fisher information which is: \[ I(\theta) = \mbox{var}\,\, u(\theta) = -\mbox{E}\left[\frac{\partial^2 l(\theta\mid y)}{\partial \theta^2}\right] \]
One can show that the variance of \(\hat{\theta}\) can be estimated by: \[ \mbox{var}(\hat{\theta}) = \frac{1}{I(\theta)} \] under mild regularity conditions. For the binomial example this gives: \[ \mbox{var}(\hat{p}) = \frac{\hat{p}(1-\hat{p})}{n} \]
Examples where likelihood can be maximized explicitly are confined to simple cases. Typically, numerical optimization is necessary. The Newton–Raphson method is the most well-known technique.
\[ \theta_{k+1} = \theta_k - H^{-1}(\theta_k) u(\theta_k) \] where \(H(\theta)\) is the Hessian matrix of second derivatives of the log likelihood function evaluated at \(\theta\). \[ H(\theta) = \frac{\partial^2 l(\theta\mid y)}{\partial \theta \partial \theta^T} \] We iterate this method, putting \(\theta_{k+1}\) in place of \(\theta_{k}\) and so on, until the procedure (hopefully) converges. The Fisher scoring method replaces \(H\) with \(-I\) and sometimes gives superior results. This method is used in fitting GLMs and is equivalent to iteratively reweighted least squares.