Chapter 4 Lasso

Least absolute shrinkage and selection operator, aka Lasso, is similar to the ridge regression but its constraint is estimated with the sum of the absolute value of the coefficient: \[ \hat{\beta}^{Lasso} = arg\min_\beta\{\sum_{i = 1}^{N}(y_i - \beta_0 - \sum_{j = 1}^{p}x_{ij}\beta_j)^2 + \lambda\sum_{j = 1}^{p}\rvert\beta_j\rvert\} \ \ \ (1) \] Which is the same as: \[ \hat{\beta}^{Lasso} = arg\min_\beta\{\sum_{i = 1}^{N}(y_i - \beta_0 - \sum_{j = 1}^{p}x_{ij}\beta_j)^2\},\text{ subject to} \sum_{j = 1}^{p}\rvert\beta_j\rvert \leq t \] Notice that in Lasso, the penalty term now becomes the sum of absolute value of the coefficients instead of sum of squares of the coefficients.

4.0.1 Standardizing the dataset

Considering lasso regression (Find x to minimize the following term):

\[ \rvert \rvert x\beta-y\rvert \rvert^2 +\lambda \rvert \rvert x \rvert \rvert_1 \]

By standardizing the dataset, the columns of A have zero mean and unit norm and the columns of b have zero mean. By making the means of columns zero, we can get rid of the intercept. Besides, since the norms of the columns in A is one, some column would no longer have smaller coefficients because of their bigger norm, which leads us incorrently conclude that this column can not explain \(x\) well. Therefore, by putting all columns into the same scale, we can use the differences in the magnitudes of the elements of \(x\) that are directly related to the “wiggliness” of the explanatory function \(Ax\), which is, loosely speaking, what the regularization tries to control.