Курс: Data Mining (HSE), Тема: Linear models for classification

We discussed models that work by assigning a certain number of points to each possible feature/variable, for instance, a certain number of points for being male (or female), a certain number of points for every year the person has and so forth. The total can be expressed as a sum $\beta^\intercal \mathbf{x}$. If the total exceeds 0, we predict the positive, otherwise the negative class.

For easier understanding, we can imagine that all variables are numeric. The points for which $\beta^\intercal x = 0$ lie on a (hyperplane), which is called a decision boundary.

The modelling problem can thus be reimagined as follows: we have a room of red (positive) and blue (negative) points, and the modelling task is to draw a plane that separates them (as well as possible). The plane is defined by the vector $\beta$.

Models of this kind are linear models.

One of them is logistic regression, which defines the probability of the class $y$ given data $\textbf{x}$ as $p(y|x) = 1 / (1 + e^{-\beta^\intercal \textbf{x}})$. It uses the logistic function to transform the distance from the plane into probabilities. Logistic regression tries to find such a plane that all points from one class are as far away from the boundary (in the correct direction) as possible. Formally, it maximizes the product of probabilites that the model will assign to the correct class. Such product is called likelihood and the process of finding the optimal decision boundary by optimizing the likelihood is called maximum likelihood estimation. You will surely encounter it in other classes, too. More in the paper below.

Another common linear model is (linear) support vector machine (SVM), which optimizes a slightly different criteria: it maximizes the distances between the plane and the closest points, with some punishment for points lying on the wrong side. We have not spent much time on this since SVMs will become more interesting later.

Our final linear model was the Naive Bayesian classifier. It is derived differently than the other two. We want to predict the probability of class $c$ given some attributes $\mathbf{x} = (x_1, x_2, \ldots, x_n)$, that is $P(c|\mathbf{x})$. By applying Bayesian rule (twice), we discover that $P(c|\mathbf{x}) \sim P(c) \prod_i P(c|x_i)$, if we naively (thence the name) assume that the attributes are independent. With some clever manipulation (read about it in the paper about nomograms), we can see that this model can also be expressed with an equation of the same form as linear regression. The only difference is (again) in how the hyperplane (expressed by $\beta$) is fit to the data.

Naive Bayesian classifier and logistic regression differ in important aspects you should remember. Difference stem from the fact that Naive Bayesian classifier is univariate (it considers a single variable at a time, independent of others), while logistic regression is multivariate.

Naive Bayesian classifier does not take correlations into account, because it assumes the attributes are independent. Considering one variable at a time, $\mathbf{w}$ contains the importance of each attribute separately. We will use it when we want to know the importance of each attributes.

Logistic regression observes all variables at once and takes the correlation into account. If some variables are correlated, their importance will be spread among them. With proper regularization (we'll talk about this later), LR can be used to find a subset of non-redundant (non-correlated, "non-overlapping") variables sufficient for making predictions.

Probabilities returned by Naive Bayesian classifier are not well-calibrated, because the method is univariate and considers the same piece of evidence multiple times. Logistic regression is usually well-calibrated (logistic function is actually used for calibrating other classifiers sometimes).

Being univariate and simpler Naive Bayesian classifier needs less data than logistic regression. It can also handling missing data: if a value is unknown, its contribution is zero. Logistic regression cannot do this.

Finally, we observed naive Bayesian classifier and logistic regression in a nomogram, which shows regression coefficients assigned to individual values. The nomogram can be used for making predictions or exploring the model. To make a prediction, we drag the points for each variable to its corresponding value and the axes at the bottom convert the sum into a prediction. In case of Bayesian classifier, we can also leave a point at the center if the value is unknown. In terms of exploring the model, the lenghts of lines in the nomogram tell us how many points a data instance can get or lose based on each variable. Bayesian classifier can also show us the impact of the value on the decision.

These as well as other differences (e.g. the nomogram for logistic regression does not tell us importance of individual features) come from the general differences between the two techniques.

Section outline