least squares

OLS (ordinary least squares)

Least squares: method of fitting the linear model $\hat{Y} = X^{T} \hat{β}$ , or

\hat{Y} = \hat{β_{0}} + j = 1 \sum p X_{j} \hat{β_{j}}

to a set of training data.

If $f (X) = X^{⊤} β$ , then we can pick coefficients $β = f^{'} (X)$ to minimize the residual sum of squares (RSS)

RSS (β) = i = 1 \sum N (y_{i} - x_{i}^{T} β)^{2}

In matrix notation:

RSS (β) = (y - X β)^{T} (y - X β)

Differentiating w.r.t. $β$ we get the normal equations

X^{⊤} (y - X β) = 0

If $X^{⊤} X$ is no-singular, then the unique solution is given by

\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y

\hat{β} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y

The regularized least squares estimate minimizes the following objective:

J (θ) = i = 1 \sum n (y_{i} - X_{i} θ)^{2} + λ j = 1 \sum p R (θ_{j})

Where:

$y_{i}$ : observed values.
$X_{i}$ : input features.
$θ$ : coefficients/parameters.
$λ$ : regularization parameter (controls the trade-off between fit and penalty).
$R (θ_{j})$ : regularization term applied to the coefficients.

Ridge Regression (L2 Regularization):

J (θ) = i = 1 \sum n (y_{i} - X_{i} θ)^{2} + λ j = 1 \sum p θ_{j}^{2}

Tends to shrink coefficients evenly, but does not force any coefficients to zero.
Use when you suspect many small/medium-sized effects.

Lasso Regression (L1 Regularization):

J (θ) = i = 1 \sum n (y_{i} - X_{i} θ)^{2} + λ j = 1 \sum p ∣ θ_{j} ∣

Can drive some coefficients exactly to zero, effectively performing feature selection.
Use when you expect only a few variables to have significant effects.

A generalization of OLS, used when error variances are not equal (hetereoskedastic).

The difficulty is estimating error variances - they are rarely known exactly. Weights are based on the estimated variances:

Let the weights be

w_{i} = \frac{1}{σ ^ _{i}^{2}}

Then the estimator is given by

\hat{β}_{WLS} = (X^{⊤} W X)^{- 1} X^{⊤} W y