Two Estimation Methods and Linear Regression

Maximum Likelihood Estimation and Bayesian Estimation

We are often interested in estimating certain values based on available observations. For example, if we want to know the average height of students in a certain university, the most precise way is to obtain the heights of every student in this university. However, this is often not possible, and we can only randomly choose a proportion of the students to form a sample of the population and obtain their heights. We thereafter estimate the average value based on the sample heights. We can view the height of every student as a random variable that obeys a normal distribution with mean $\theta$. All the sample random variables are identically and independently distributed. Given these observations, what is the most likely value of $\theta$? It is intuitive to argue that $\theta$ should be the average value of all observations. Indeed, given $\theta$, we can obtain the distribution of every random variable and thus their joint distribution by multiplying them together, considering our assumption that they are independently distributed. Obeying this joint distribution, we can then calculate the likelihood of observing the observed data. $$ f_n(X\mid\theta) $$ Among all possible $\theta$, we choose that very value of $\theta$ that gives us the maximized value of that likelihood. Under the assumptions we have made, one can show, by doing a little bit of algebra, we really should choose the sample mean to estimate $\theta$, or namely, choose it to be $\hat\theta$. It worth noting that during this procedure, the value of $\hat\theta$ is alone determined by the sample data. From a Bayesian viewpoint, $\theta$ itself is a random variable and obeys certain distribution. This is called a prior distribution. By a prior distribution, we imply that, before any observation, we first claim that it is likely that $\theta$ take certain values and unlikely take others. How do we give a prior distribution? By (statistical) experience. Sometimes this turns into some kind of common sense. For example, we may assume that the average height should obey a normal distribution with mean $1.70$ and standard variance $0.1$, before we ever obtain any data. After we obtain the sample data, we make use of the Bayes’ Theorem: $$ ξ(\theta \mid X) \propto ξ(\theta) f_n(X\mid\theta) $$ where the posterior distribution is proportional to the prior distribution times the likelihood of observed data. We then choose the $\theta$ that maximizes the posterior distribution as $\hat\theta$. It is easy to see that now the value of $\hat\theta$ is determined jointly by the sample data and the prior distribution. This is often advantageous in that, in case we accidentally select the basketball team as our sample, the maximum likelihood approach will certainly yield an extremely high and unreliable result; on the other hand, the Bayesian approach will try to balance our common sense and the observations and yield a more reliable result. It is also clear that if in the Bayesian approach we simply determine the prior distribution as the uniform distribution all over the parameter space, it is the same as giving up the prior distribution and take the maximum likelihood approach.

Linear Regression and Maximum Likelihood Estimation

We now discuss the relationship between linear regression and maximum likelihood estimation. Here we suppose there is only one predictor ($X$, or independent variable), but the discussion of more predictors also holds. In a linear regression model, we assume that $$ Y= \beta_0 + \beta_1 X +\epsilon $$ where $\epsilon$ is the observation error, which we assume obeys a normal distribution with mean $0$. Given $n$ observations, we want to estimate the values of $\beta_0, \beta_1$ as $\hat\beta_0, \hat\beta_1$. We want to maximize the likelihood function: $$ f_n(\epsilon \mid \beta_0, \beta_1) $$ That is, given the value of the parameters, we want the likelihood of observing these residuals to be maximized. The joint distribution of all observation residuals are proportional to the following: $$ f_n(\epsilon \mid \beta_0, \beta_1) \propto \exp(-\frac{1}{2\sigma^2}\sum_{i=1}^n \epsilon_i^2) $$ In order to maximize this value, one would only have to minimize $$ \text{RSS} = \sum_{i=1}^n\epsilon_i^2 $$ where RSS stands for residual sum of squares. Note that $$ \epsilon_i = y_i - \hat\beta_0 -\hat\beta_1x_i $$ Thus, $$ \text{RSS} = \sum_{i=1}^n\epsilon_i^2 = \sum_{i=1}^n (y_i - \hat\beta_0 -\hat\beta_1x_i)^2 $$ Using calculus one can easily solve the values of $\hat\beta_0, \hat\beta_1$ that minimize this formula. The above deduction gives an explanation on why we choose minimizing RSS as the criteria of best values of $\hat\beta_0, \hat\beta_1$, despite the seemingly intuitive “correctness” of this criteria.

Problems with Linear Regression

Since linear regression is based on maximum likelihood estimation, it is certain that it will inherit in itself the disadvantages of maximum likelihood estimation. This is mainly the high variance of the model, a property that allows a model to vary dramatically depending on the chosen sample and often causes the problem of overfitting. The problem is especially pronounced when we have data of high dimensions, which has equal or more predictors than the sample size ($p\ge n$), where linear regression is bound to overfit and can even have non-unique solutions.

Ridge Regression, Lasso, and Bayesian Estimation

To fix the problems arising in linear regression, one can make use of, among other approaches, Bayesian estimation to estimate the coefficient values, which imposes a limit on the variance of the parameters. Recall that in a maximum likelihood setting, we implicitly assume that the possibility that the parameters take any value is the same. However, in a Bayesian context, we must have a prior distribution of the parameters. The top 2 popular prior distributions of the parameters are

Normal distribution with mean 0, and
Laplace distribution with mean 0.

We therefore imply that the probability for a coefficient to take a value around 0 is more much higher than a value extremely high or low. Notice that, in a multi-variable setting, this can only make sense if we standardize the data beforehand to make different predictors comparable. Also, the intercept $\beta_0$ does not need to have a prior distribution. We now discuss in a multi-predictor context, where our model is $$ Y = \beta_0 +\sum_{j=1}^n\beta_jX_j+\epsilon $$

Ridge Regression

We now suppose the parameters have a prior normal distribution of mean 0. We want to maximize the posterior distribution: $$ ξ(\beta_0, \beta_1, …, \beta_n \mid \epsilon) \propto ξ(\beta_1, \beta_2, …, \beta_n) f_n(\epsilon\mid \beta_0, \beta_1, …, \beta_n) $$ The joint distribution of all observation residuals are proportional to the following: $$ f_n(\epsilon\mid \beta_0, \beta_1, …, \beta_n) \propto \exp(-\frac{1}{2\sigma^2}\sum_{i=1}^n \epsilon_i^2) $$ Meanwhile, a normal distribution of each i.i.d. parameters means that $$ ξ(\beta_1, \beta_2, …, \beta_n) = \prod_{i=1}^n \xi(\beta_i) \propto \exp(-\frac{1}{2\sigma’^2}\sum_{i=1}^n \beta_i^2) $$ Thus, $$ ξ(\beta_0, \beta_1, …, \beta_n \mid \epsilon) \propto \exp(-\frac{1}{2\sigma^2}\sum_{i=1}^n \epsilon_i^2-\frac{1}{2\sigma’^2}\sum_{i=1}^n \beta_i^2) $$ To maximize this posterior distribution, one only has to minimize the following: $$ \sum_{i=1}^n \epsilon_i^2 + \lambda \sum_{i=1}^n \beta_i^2 $$ where $$ \epsilon_i = y_i - \beta_0 - \sum_{i=1}^n \beta_i x_i $$ and $\lambda$ is a positive number, in practice to be determined by other methods (cross validation or validation set approach). This is how ridge regression is deduced from Bayesian estimation. It is clear that compared to the original linear regression, ridge regression additionally imposes a penalty on large coefficients.

Lasso

Using similar reasoning, one can show that, assuming the coefficients to obey a Laplace prior distribution, one is to minimize the following: $$ \sum_{i=1}^n \epsilon_i^2 + \lambda \sum_{i=1}^n \beta_i $$ It is said that Lasso is better at predictor selection since it can often produce coefficients of value 0, compared with ridge regression. This property is rooted in the the p.d.f of Laplace distributions: at $x = 0$ there is an non-differentiable sharp corner that produces a maximum.