Graduate → Probability and Statistics → Statistical Inference ↓
Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is a statistical technique used to estimate the parameters of a statistical model. This fundamental technique is widely applicable in various fields such as economics, finance, biology, machine learning, and others. In simple terms, MLE helps you find the parameter values for the model that make your observed data most probable. The main idea is to maximize the likelihood function, which is the probability of observing the data given a set of parameters.
Understanding the basic concepts
Before delving into the intricacies of maximum likelihood estimation, it is essential to understand some basic statistical and probability concepts.
Statistical models
A statistical model is a mathematical representation that describes random variables and their relationships. For example, suppose you have a set of data points. A statistical model might express these data points as normally distributed, with a mean (mu) and a standard deviation (sigma).
Parameters
Parameters are aspects of the model that can be adjusted to fit the data. In the normal distribution example above, the parameters are the mean ((mu)) and the standard deviation ((sigma)).
Possibility
The likelihood of a set of parameter values is defined as the probability of observing the data given those parameters. It is not a probability itself, but rather a function of the parameters. In more formal terms, if (theta) represents the parameters of the distribution and (X) represents the observed data, then the likelihood is (L(theta | X)).
The procedure for maximum likelihood estimation
Let us analyse the steps involved in maximum likelihood estimation:
Step 1: Choose a statistical model
The first step is to choose an appropriate statistical model based on the nature of your data. For example, if you are dealing with height measurement data, modeling it as a normal distribution may be appropriate.
Step 2: Define the likelihood function
The next step is to define the likelihood function for the model you choose. For example, assume your data is normally distributed; the likelihood function for the data sample (X = x_1, x_2, ldots, x_n) and the parameters (mu) (mean) and (sigma^2) (variance) is:
L(mu, sigma^2 | X) = prod_{i=1}^{n} frac{1}{sqrt{2pisigma^2}} expleft(-frac{(x_i - mu)^2}{2sigma^2}right)
Step 3: Maximize the likelihood function
To find the values of the parameters that maximize this likelihood function, we usually work with the log-likelihood function because it is often easier to handle:
ell(mu, sigma^2 | X) = sum_{i=1}^{n} left( -frac{1}{2} log(2pisigma^2) - frac{(x_i - mu)^2}{2sigma^2} right)
You then take the derivative of the log-likelihood function with respect to the parameters and set them to zero to solve for the parameters.
Step 4: Solve for the parameters
For the normal distribution:
frac{partial}{partial mu}ell(mu, sigma^2 | X) = 0 quad Rightarrow quad hat{mu} = frac{1}{n}sum_{i=1}^{n} x_i
frac{partial}{partial sigma^2}ell(mu, sigma^2 | X) = 0 quad Rightarrow quad hat{sigma}^2 = frac{1}{n}sum_{i=1}^{n} (x_i - hat{mu})^2
The solutions, (hat{mu}) and (hat{sigma}^2), are maximum likelihood estimates for the mean and variance.
Visual example
Let’s imagine how the likelihood function works for a simple dataset and a model with a single parameter, (p), which represents the probability of success in a binomial distribution. We’ll take data from a series of independent coin-flips resulting in 4 heads out of 10 flips.
This example shows the likelihood function for the binomial distribution. The red dot indicates the parameter value (p) that maximizes the likelihood function, which in this case is about 0.4 or 40% probability of getting heads.
Properties of maximum likelihood estimators
Maximum likelihood estimators have several notable properties that make them particularly useful in statistical inference:
Stability
As the sample size increases, an estimator is consistent if it converges to the probability of the true parameter value. MLEs usually have this property under standard conditions, meaning that they become more accurate as you collect more data.
Capacity
In the context of unbiased estimators, efficiency refers to the fact that the MLE achieves the smallest possible variance among all unbiased estimators of the parameter. This variance is known as the Cramer-Rao lower bound.
Normal state
Under certain regularity conditions, the distribution of the MLE tends towards a normal distribution as the sample size increases. This is particularly useful for constructing confidence intervals.
Examples of MLE in different models
Example 1: Estimating the parameter of the exponential distribution
Consider an exponential distribution with parameter (lambda). If you have a dataset (X = x_1, x_2, ..., x_n), then the likelihood function is given by:
L(lambda | X) = prod_{i=1}^{n} lambda exp(-lambda x_i)
Taking the logarithm:
ell(lambda | X) = n log(lambda) - lambda sum_{i=1}^{n} x_i
Setting the derivative to zero gives:
frac{partial}{partial lambda}ell(lambda | X) = frac{n}{lambda} - sum_{i=1}^{n} x_i = 0
hat{lambda} = frac{n}{sum_{i=1}^{n} x_i}
Example 2: Estimating parameters in a linear regression model
In a simple linear regression model of the form (y = beta_0 + beta_1 x + epsilon), where (epsilon sim N(0, sigma^2)), the likelihood function is:
L(beta_0, beta_1, sigma^2 | y, x) = prod_{i=1}^{n} frac{1}{sqrt{2pisigma^2}} expleft(-frac{(y_i - beta_0 - beta_1 x_i)^2}{2sigma^2}right)
Maximizing this likelihood involves finding estimates for (beta_0), (beta_1), and (sigma^2). However, this quickly turns into solving the usual equations or using matrix calculus to obtain:
hat{beta} = (X^TX)^{-1} X^T y
for a vector of coefficients in matrix form (beta).
Advantages and disadvantages of M.L.E.
Understanding the advantages and disadvantages of MLE can help you decide whether it is an appropriate method for parameter estimation.
Benefit
- Flexibility: MLE can be applied to many different distributions and scenarios. The fundamental concept of maximizing the likelihood of observed data aligns well with a variety of situations.
- Asymptotic properties: As discussed above, MLEs have some desirable asymptotic properties, such as consistency, efficiency, and normality, which make them statistically robust for large samples.
- Explainability: This method produces a straightforward result – the estimated parameters make the observed data 'most likely', given the assumptions of the model.
Loss
- Complexity: For complex models, the likelihood function can be complicated, and maximizing it may require sophisticated numerical methods. This can be computationally intensive.
- Sensitivity to model assumptions: MLEs are highly dependent on the accuracy of the model. Misspecifying the model can lead to biased parameter estimates.
- Drawbacks of finite sample: In small sample sizes, MLE may not exhibit its asymptotic properties such as efficiency, resulting in less reliable estimates.
Conclusion
Maximum likelihood estimation stands as a cornerstone technique in statistical inference, providing a structured and powerful approach to parameter estimation in a wide variety of statistical models. However, it requires careful consideration of the choice of model, as well as a readiness to tackle computational challenges in complex models. Despite its limitations, its flexible application and asymptotic properties secure its continued relevance and widespread use in both theoretical and applied statistics.