Section outline

  • In the lecture, we gretaly improved the algorithms we have seen so far.

    • Support vector machines (SVM) become non-linear. The problem with linear support vector machines is that they are linear, that is, the boundary they draw is always a hyperplane. We observed that we can use a similar trick as in linear regression: add squares of variables and their products. The trick (which actually has a name: kernel trick) is how to expand the data into higher-dimensional, perhaps even infinite-dimensional space, without explicitly transforming it. It turns out the SVM (and other linear methods) can be reformulated in such way that they use the data instances (e.g. $x_i$, $x_j$) only to compute scalar products - in this context, they are usually written as $\left<x_i, x_j\right>$. If data is expanded using a certain function $\phi(x)$ that, for instance, adds products, we need to compute $\left<\phi(x_i), \phi(x_j)\right>$ instead. It is often possible to compute this product without explicitly computing $\phi$, but instead compute another function of $x_i$ and $x_j$, that is, we compute some (sufficiently simple) $K(x_i, x_j)$ whose value is equal to $\left<x_i, x_j\right>$. Such function $K$ is called a kernel.

      Kernel function can be interpreted as a function that computes the similarity between the two data instances. If $\phi(x)=x$, the kernel is just an ordinary scalar product that, if data rows are normalized, computes whether two vector point into the same direction (1), in the opposite (-1) or anywhere in between.

      One can thus go from the opposite end: define some similarity measure $K$ and use it as a kernel without explicitly define the mapping $\phi$. The magic of SVM (and other methods that can use kernels, and are thus called kernel methods) is that they will implicitly "invent" a transformation into a (usually infinite-dimensional) space, in which the distances between objects are such as prescibed by the kernel, and draw a hyperplane in this space.

      Abstract talking aside, SVM with different kernels can split the data not by ordinary hyperplanes, but with more complex curves. The complexity of the curve is decided by the kernel type and by the arguments given to the algorithm, like the degree and coefficients, and the penalty for misclassifications.

    • Classification trees become forests. The basic limitation of trees (in particular compared to linear models) is that they cannot sum up the evidence. When the decision requires taking into account hundreds or even thousands of small factors, trees won't do. Consider even ten binary variables: if each decision needs to take into account all of them, then every path from the root to a leave will have 10 steps, hence the tree will have 210=1024 leaves, which will require 1024 data instances just to have a single one in each leave (and we know that's far from enough). With 20 binary variables, we have 220≈1million leaves and with 30 we have 230≈1billion. With multiple values instead of binary, or with numeric features ... you get the picture.

      Instead of a single tree, we construct many of them by somehow randomizing their induction. We can use different sample of the data or of variables, or change the algorithm to pick one of the best-rated attributes instead of the best one. We sometimes build entire trees, and sometimes severely limit them, like allowing just a few levels. Due to the algorithm's instability, we'll have very different trees. We create thousands of them and call it a random forest. We classify by voting, and we can also make better probability estimates by counting the number of trees voting in favour and against a particular decision.

    • K-nearest neighbours. With a bit of a stretch, we related this to the naive Bayesian classifier. It's problem is naivety: it computes $P(x_1, x_2, \ldots, x_n|c)$ as $\prod_i P(x_i|c)$ because the latter can be easily estimated from the data and the former can't since the data probably does not contain any (or, at least, not enough) instances with values exactly equal to $(x_1, x_2, \ldots, x_n)$. The idea of k-nearest neighbours is to find the $k$ instances that are the most similar to $(x_1, x_2, \ldots, x_n)$. We make the prediction or estimate probabilities based on the classes of these $k$ instances.

      Unlike all algorithms we have seen so far, this one does not construct a model but just stores the data. This kind of learning is called lazy learning.

    Which algorithm to choose?

    The first question is whether you want to be able to explain the decision.

    • If your primary aim is to construct a model and observe it to find some patterns in your data, you can use linear models (with nomograms) or trees.

    • If you need the model to help you make decisions, you probably want to know what the model's predictions are based on - you want the model to explain the prediction. In this case, use any of the above or maybe random forest (checking which trees voted for some class and which variables they used may provide some insight) or k-nearest neighbours (the explanation consists of the algorithm showing you the most similar data instances).

    • If you aim at high accuracy, compute the accuracy of all models with some kind of cross validation (as in Test & Score widget), and pick the best model. Linear models (Bayes, Logistic regression) should work best for some simple data, kNNs may sometimes surprise, SVMs are good if its parameters are fit properly, and random forests are the method to bet on when there are many variables that give small pieces of information. An additional advantage of random forests is that they often don't need a lot of fitting - just construct a lot of (like one thousand) trees.