Regularization
Section outline
-
Linear regression can be made non-linear by adding columns with products of variables. This increases it capacity to fit the data, but at the same time its capacity to overfit. We have used a widget Polynomial regression in Orange's add-on Educational, where then curve through the points we draw in Paint goes through all the data at a cost of wildly steering up and down; we also observed the coefficients in a data table and saw that they get huge.
To counter this, we change the function that is optimized by the regression. Regression fits a model $\hat{y} = \sum_i\beta_i x_i$ by optimizing a loss function such as mean square error, $\sum_j (y_i - \hat{y_i})^2$. To this, we add a punishment for large coefficients, so we optimize $\sum_j (y_i - \hat{y_i})^2 + \alpha\sum_i \beta_i^2$. The first term punishes bad fits, the second punishes large coefficient. The factor $\alpha$ balances between the two parts: larger $\alpha$ forces the model to fit the data less, that is, to keep all coefficients small.
Alternatively, we can take absolute values of coefficients instead of squaring them, $\sum_j (y_i - \hat{y_i})^2 + \alpha\sum_i |\beta_i|$. This will encourage the model to set coefficients to zero instead of keeping them small (see a youtube video with an explanation why this happens). This kind of regularization is thus useful for selecting subsets of features.
The first type of regularization is called ridge regression or L2, and the second is LASSO or L1.
Similar regularization can be used in classification, in particular in logistic regression.
We should take this as a concrete example of a general problem. In our first lecture, where we said it is impossible to decide what is the right number of bins in distribution, or the appropriate amount of smoothing. In modelling, we balance between overfitting and overgeneralizing (or undergeneralizing and underfitting). If we have a rather smooth decision boundary and some data instances are on the wrong side, do we curve the boundary (less generalizing, more fitting) or do we treat the data on the wrong side as pure noise in the data (and thus generalize more and fit less)?
Practically every model has a parameter for setting the balance, like the above $\alpha$.
Classification trees: setting the requires number of data instances in leaves or sufficient proportion of majority class decides when to stop splitting the data and thus "smoothing" the decision boundary. Smaller trees are more general and less (over?)fit.
Logistic regression: we have L1 and L2, just as described above.
SVM: there are a bunch of parameters that set the complexity of the boundary and the misclassification penalty.
k Nearest neighbours: the larger the number of neighbours, the more the model moves towards the mean/majority.
Naive Bayesian classifier is usually implemented without balancing, though one can essentially regularize probabilities $P(c|x_1)$. We are not going deeper here since you will have a whole course dedicated to it.
Random forest: this method is strange because there is no clear consensus about how and when it can overfit. It is surely a bad idea to have a small forest with large trees, but with small trees it seems that you can't overfit by increasing the number of trees.
In summary, with power comes responsibility. The more complex the model, the more it can overfit and the more you have to be careful avoid overfitting it.
-
We are just telling you about this book because we must do it at some point. It is too difficult for this course, but it provides an overview of machine learning methods from statistical point of view. Chapters 4 and 5 should not be too difficult, and you can read them to better understand linear models and regularization.
You can download the book for free.