We constructed a model for deciding whether somebody will survive the sinking of Titanic by assigning a certain number of points to each person's travelling class, age and gender. We can the attributes into dummy variables - in the simplest, but not the optimal case, we describe each person with $\mathbf{x}=(x_\textrm{first}, x_\textrm{second}, x_\textrm{third}, x_\textrm{crew}, x_\textrm{child}, x_\textrm{adult}, x_\textrm{male}, x_\textrm{female})$, where a variable can have value 0 or 1; a girl travelling in the second class would then be $\textbf{x} = (0, 1, 0, 0, 1, 0, 0, 1)$. If we put the number of points for each property into a vector $\mathbf{w}$ (say 8 points for the first class, 3 for second, -2 for third, -5 for crew and so forth), the total number of points can be computed as $\mathbf{w}^\intercal \mathbf{x}$. If the total exceeds 0, the persons leaves, if it's negative (s)he dies.

The points for which $\mathbf{w}^\intercal x = 0$ lie on a (hyperplane), which is called a *decision boundary*.

For easier understanding, we can also imagine the class, age and gender as continouos variables (*class*: possibly, *age*: obviously, *gender*: absolutely - in the US). Every passenger represents a point in two-dimensional space and the plane literally cuts through this space, separating the survivors from the victims.

The modelling problem can thus be reimagined as follows: we have a room of red (surviving) and blue (dying) points, and the modelling task is to draw a plane that separates them (as well as possible). The plane is defined by the vector $\textbf{w}$.

Models of this kind are *linear models*.

One of them is logistic regression, which defines the probability of the class $y$ given data $\textbf{x}$ as $p(y|x) = 1 / (1 + e^{-\textbf{w}^\intercal \textbf{x}})$. It uses the logistic function to transform the distance from the plane into probabilities. Logistic regression tries to find such a plane that all points from one class are as far away from the boundary (in the correct direction) as possible. Formally, it maximizes the product of probabilites that the model will assign to the correct class. Such product is called *likelihood* and the process of finding the optimal decision boundary by optimizing the likelihood is called *maximum likelihood estimation*. You will surely encounter it in other classes, too. More in the paper below.

Another common linear model is (linear) support vector machine (SVM), which optimizes a slightly different criteria: it maximizes the distances between the plane and the closest points, with some punishment for points lying on the wrong side. We have not spent much time on this since SVMs will become more interesting later.

Our final linear model is the Naive Bayesian classifier. It is derived differently than the other two. We want to predict the probability of class $c$ given some attributes $\mathbf{x} = (x_1, x_2, \ldots, x_n)$, that is $P(c|\mathbf{x})$. By applying Bayesian rule (twice), we discover that $P(c|\mathbf{x}) \sim P(c) \prod_i P(c|x_i)$, if we *naively* (thence the name) assume that the attributes are independent. With some clever manipulation (read about it in the paper about nomograms), we can see that this model can also be expressed with an equation of the same form as linear regression. The only difference is (again) in how the hyperplane (expressed by $\mathbf{w}$) is fit to the data.

Naive Bayesian classifier and logistic regression differ in important aspects you should remember. Difference stem from the fact that Naive Bayesian classifier is univariate (it considers a single variable at a time, independent of others), while logistic regression is multivariate.

Naive Bayesian classifier does not take correlations into account, because it assumes the attributes are independent. Considering one variable at a time, $\mathbf{w}$ contains the importance of each attribute separately. We will use it when we want to know the importance of each attributes.

Logistic regression observes all variables at once and takes the correlation into account. If some variables are correlated, their importance will be spread among them. With proper regularization (we'll talk about this later), LR can be used to find a subset of non-redundant (non-correlated, "non-overlapping") variables sufficient for making predictions.

Probabilities returned by Naive Bayesian classifier are not well-calibrated, because the method is univariate and considers the same piece of evidence multiple times. Logistic regression is usually well-calibrated (logistic function is actually used for calibrating other classifiers sometimes).

Being univariate and simpler Naive Bayesian classifier needs less data than logistic regression. It can also handling missing data: if a value is unknown, its contribution is zero. Logistic regression cannot do this.

Finally, we observed naive Bayesian classifier and logistic regression in a nomogram, which shows regression coefficients assigned to individual values. The nomogram can be used for making predictions or exploring the model. To make a prediction, we drag the points for each variable to its corresponding value and the axes at the bottom convert the sum into a prediction. In case of Bayesian classifier, we can also leave a point at the center if the value is unknown. In terms of exploring the model, the lenghts of lines in the nomogram tell us how many points a data instance can get or lose based on each variable. Bayesian classifier can also show us the impact of the value on the decision; for instance, in Titanic data, being child increases the probability of survival, while being an adult does not affect it.

These as well as other differences (e.g. the nomogram for logistic regression does not tell us importance of individual features) come from the general differences between the two techniques.