## Topic outline

• ### .

On this page, we will publish short reminders about topics covered. Instead of endless lists of scientific papers, we will give you short and friendly articles. We won't shy away from directing you to Wikipedia, too.

• ### Visualizations (and Getting to Know Orange)

The purpose of visualization could be to:

• provide nice illustrations for newspaper articles;
• help us discover patterns in the data.

In data mining, we use them for the latter (except when trying to impress our bosses). We have not evolved as number crunching animals, so our brains are specialized for visual patterns. Given a sequence of numbers, we are not able to find patterns, like trends and correlations. Seeing a picture, we are. So the main purpose of insight-giving visualizations are to present the data in such a form that our brains can spot things that we would not be able to spot if the data was presented as just numbers (or whatever human-unfriendly form the data has).

Here is a brief summary of what we covered.

• Box Plot: shows basic statistics, good for observing a single variable. We can split it into multiple box plots by a categorical variable. In this case, we can compute chi squares or ANOVAs and order the variables by how well they variable splits the categories.

• Distribution: shows distributions as histograms. We can again split them by a categorical variable. We emphasized the problem of choosing the number of bins, which can strongly affect the visualization. Curves also have to be considered carefully since they are either result of some smoothing, which has some parameters that are hidden or difficult to set and interpret, or they result of fitting of parameters to a shape that we have to decide -- and often have no good argument for a particular shape. Bins are essentially better because they do not hide anything -- unless the criteria for deciding the number of bins or boundaries are manipulated.

• Scatter Plot: plots the relationship between two numeric variables. To show other quantities in the same plot, we can use colors, sizes, shapes or other fancy properties, such as angles. Out of these three, colors are the only one that really works. It is easy to see that a certain region has a prevalence of certain colors. (Sometimes this is too easy to see. We are very able to spot patterns where there are none. We'll see some examples in due time.)

A note about using colors for showing numeric quantities. It is a very good idea to use discrete scale with not too many colors. An average human can distinguish roughly five different colors, except for women who can distinguish about a dozen thousands. If the scale of colors is continuous, it is impossible to compare a color in a legend with a color of the spot on the graph. While continuous scales appear to be more exact than discrete scales with, say, eight different colors, the latter are probably better.

You have learned about other traps (such as using sizes) in the other course.

• Mosaic Display: splits the data by 1-4 variables into ever smaller subgroups, such that the size of the area corresponds to the size of the group. The area can be split further to show the target variables distribution inside the group. This visualization is useful for finding relations between categorical features in a similar fashion as the scatter plot does for numeric.

• Sieve Diagram: essentially a visualization of chi-square, it plots actual frequencies of two variables against their expected frequencies. Unlike in the Mosaic display, the area is split independently along each axis, so sizes correspond to the expected number of instances assuming that the variables are independent. Sieve then shows - using a grid and colors - the violations of independence, that is, combinations of values that are more common or rare than expected.

• ### Introduction to predictive modelling

In statistics, you would usually start with a matrix whose rows represent (independent) data instances and columns represent variables. We would have a matrix $\mathbf{X}$ of independent variables and a vector $\mathbf{y}$ with dependent variable (target, outcome), and the goal would be to find a function $f$ such that $\mathbf{\hat{y}} = f(\mathbf{X})$ would match $\mathbf{y}$ as closely as possible. The shape of function $f$ would usually be fixed (e.g. $f$ would be a linear function) and we would aim to minimize $(\mathbf{\hat{y}} - \mathbf{y})^2$ or something similar.

Since the function is given, the usual term is that we are fitting the model to the data.

Limitations of the usual statistical approaches are that variables must be numeric, and functions have to be differentiable or sufficiently humble in some other respect.

Pioneers of AI considered more general problems in which data instances were not necessarily described by attributes but rather with relations, and the goal would be to learn or to find a theory that would compute some particular relation (e.g. trains going west or east) based on other relations (e.g. trains' compositions).

These goals were interesting but were not really achieved as the focus moved to an easier subject of attribute-based learning. Here, data instances are described by their attributes (akin to independent variables) and separated into classes (like dependent variables). Unlike in statistics, attributes can be numeric or categorical and the function $f$ can also be of any form, such as, for instance, a list of rules. The terminology used to be that we are learning the model, though as the field became more mathematical, it also adopted the term fitting.

We have used classification trees as a simple and handy example of machine learning model. We discovered its basic idea (split further and further until you arrive at pure groups or run out of data) and the spent a lot of time with computation of entropy, information gain and information gain ratio. More on that in the links below.

• ### Measuring performance of models

We saw a model that was able to predict randomly assigned data labels. Since this is clearly impossible, something must had been wrong in our procedure.

We discovered that the tree essentially memorized the data: it split the space of data instances (you can imaging this space as a scatter plot) so that each part (a leaf of the tree) contained just a few data instances, all belonging to the same class. When making predictions, the tree just checked the region to which the data instance belongs and recalled the correct class.

To avoid such situations, we should never ever estimate the quality of the tree by measuring its performance of the data that was used for learning the tree (also called training data), but always retain separate data for testing. We discussed different sampling techniques, the most known of which is cross validation.

Next we focused on measuring the quality of models. We can use them for (at least) two purposes to make predictions, or to get some insight into the data.

For the latter, a good model is a model that gives us some useful insight. We should treat models as hypothesis, as ideas, and test them against our background knowledge (does it make sense? does the tree use attributes that are related to the problem? ...)

For the former, we have different performance scores. Classification accuracy is the simplest, but it ignores the types of mistakes; classifying a sick person as healthy is (usually, at least for most observers :) worse than vice versa. Hence we have derived a number of scores that measure probabilities of various types of mistakes, such as the probability that a sick person will be classified as sick (sensitivity or recall), and the probability that a person classified as sick is actually sick (precision). There are dozens of such combinations. It is less important to know their names than to be aware that the score we are using should correspond to our goals.

Many of the above measures appear in pairs: by changing the decision threshold, we can improve one at a cost of another. To visualize this, we use curves, such as ROC curve.

Everybody should draw a ROC curve manually at least once in his life. We've done so in the lecture. This led us to think about other interesting properties of the curve and especially the interpretation of the area under it. All that we've done and said can be found in the paper we cite below An introduction to ROC analysis. To test our understanding, we solved a problem faced by Sara, a hamster veterinarian.

The most important message here was that when making predictions, we have to set a decision threshold and the threshold usually balances between two opposing quantities (sensitivity-specificity, precision-recall), and can also take costs into consideration. In practice we'd often construct other types of curves to show the possible operating points.

Finally, we mentioned calibration curves that show us the calibration of model's probabilities.

• ### Linear models for classification

We constructed a model for deciding whether somebody will survive the sinking of Titanic by assigning a certain number of points to each person's travelling class, age and gender. We can the attributes into dummy variables - in the simplest, but not the optimal case, we describe each person with $\mathbf{x}=(x_\textrm{first}, x_\textrm{second}, x_\textrm{third}, x_\textrm{crew}, x_\textrm{child}, x_\textrm{adult}, x_\textrm{male}, x_\textrm{female})$, where a variable can have value 0 or 1; a girl travelling in the second class would then be $\textbf{x} = (0, 1, 0, 0, 1, 0, 0, 1)$. If we put the number of points for each property into a vector $\mathbf{w}$ (say 8 points for the first class, 3 for second, -2 for third, -5 for crew and so forth), the total number of points can be computed as $\mathbf{w}^\intercal \mathbf{x}$. If the total exceeds 0, the persons lives, if it's negative (s)he dies.

The points for which $\mathbf{w}^\intercal x = 0$ lie on a (hyperplane), which is called a decision boundary.

For easier understanding, we can also imagine the class, age and gender as continouos variables (class: possibly, age: obviously, gender: absolutely - in the US). Every passenger represents a point in two-dimensional space and the plane literally cuts through this space, separating the survivors from the victims.

The modelling problem can thus be reimagined as follows: we have a room of red (surviving) and blue (dying) points, and the modelling task is to draw a plane that separates them (as well as possible). The plane is defined by the vector $\textbf{w}$.

Models of this kind are linear models.

One of them is logistic regression, which defines the probability of the class $y$ given data $\textbf{x}$ as $p(y|x) = 1 / (1 + e^{-\textbf{w}^\intercal \textbf{x}})$. It uses the logistic function to transform the distance from the plane into probabilities. Logistic regression tries to find such a plane that all points from one class are as far away from the boundary (in the correct direction) as possible. Formally, it maximizes the product of probabilites that the model will assign to the correct class. Such product is called likelihood and the process of finding the optimal decision boundary by optimizing the likelihood is called maximum likelihood estimation. You will surely encounter it in other classes, too. More in the paper below.

Another common linear model is (linear) support vector machine (SVM), which optimizes a slightly different criteria: it maximizes the distances between the plane and the closest points, with some punishment for points lying on the wrong side. We have not spent much time on this since SVMs will become more interesting later.

Our final linear model is the Naive Bayesian classifier. It is derived differently than the other two. We want to predict the probability of class $c$ given some attributes $\mathbf{x} = (x_1, x_2, \ldots, x_n)$, that is $P(c|\mathbf{x})$. By applying Bayesian rule (twice), we discover that $P(c|\mathbf{x}) \sim P(c) \prod_i P(c|x_i)$, if we naively (thence the name) assume that the attributes are independent. With some clever manipulation (read about it in the paper about nomograms), we can see that this model can also be expressed with an equation of the same form as linear regression. The only difference is (again) in how the hyperplane (expressed by $\mathbf{w}$) is fit to the data.

Naive Bayesian classifier and logistic regression differ in important aspects you should remember. Difference stem from the fact that Naive Bayesian classifier is univariate (it considers a single variable at a time, independent of others), while logistic regression is multivariate.

Naive Bayesian classifier does not take correlations into account, because it assumes the attributes are independent. Considering one variable at a time, $\mathbf{w}$ contains the importance of each attribute separately. We will use it when we want to know the importance of each attributes.

Logistic regression observes all variables at once and takes the correlation into account. If some variables are correlated, their importance will be spread among them. With proper regularization (we'll talk about this later), LR can be used to find a subset of non-redundant (non-correlated, "non-overlapping") variables sufficient for making predictions.

Probabilities returned by Naive Bayesian classifier are not well-calibrated, because the method is univariate and considers the same piece of evidence multiple times. Logistic regression is usually well-calibrated (logistic function is actually used for calibrating other classifiers sometimes).

Being univariate and simpler Naive Bayesian classifier needs less data than logistic regression. It can also handling missing data: if a value is unknown, its contribution is zero. Logistic regression cannot do this.

Finally, we observed naive Bayesian classifier and logistic regression in a nomogram, which shows regression coefficients assigned to individual values. The nomogram can be used for making predictions or exploring the model. To make a prediction, we drag the points for each variable to its corresponding value and the axes at the bottom convert the sum into a prediction. In case of Bayesian classifier, we can also leave a point at the center if the value is unknown. In terms of exploring the model, the lenghts of lines in the nomogram tell us how many points a data instance can get or lose based on each variable. Bayesian classifier can also show us the impact of the value on the decision; for instance, in Titanic data, being child increases the probability of survival, while being an adult does not affect it.

These as well as other differences (e.g. the nomogram for logistic regression does not tell us importance of individual features) come from the general differences between the two techniques.

• ### Other types of classifiers

In the lecture, we gretaly improved the algorithms we have seen so far.

• Support vector machines (SVM) become non-linear. The problem with linear support vector machines is that they are linear, that is, the boundary they draw is always a hyperplane. We observed that we can use a similar trick as in linear regression: add squares of variables and their products. The trick (which actually has a name: kernel trick) is how to expand the data into higher-dimensional, perhaps even infinite-dimensional space, without explicitly transforming it. It turns out the SVM (and other linear methods) can be reformulated in such way that they use the data instances (e.g. $x_i$, $x_j$) only to compute scalar products - in this context, they are usually written as $\left<x_i, x_j\right>$. If data is expanded using a certain function $\phi(x)$ that, for instance, adds products, we need to compute $\left<\phi(x_i), \phi(x_j)\right>$ instead. It is often possible to compute this product without explicitly computing $\phi$, but instead compute another function of $x_i$ and $x_j$, that is, we compute some (sufficiently simple) $K(x_i, x_j)$ whose value is equal to $\left<x_i, x_j\right>$. Such function $K$ is called a kernel.

Kernel function can be interpreted as a function that computes the similarity between the two data instances. If $\phi(x)=x$, the kernel is just an ordinary scalar product that, if data rows are normalized, computes whether two vector point into the same direction (1), in the opposite (-1) or anywhere in between.

One can thus go from the opposite end: define some similarity measure $K$ and use it as a kernel without explicitly define the mapping $\phi$. The magic of SVM (and other methods that can use kernels, and are thus called kernel methods) is that they will implicitly "invent" a transformation into a (usually infinite-dimensional) space, in which the distances between objects are such as prescibed by the kernel, and draw a hyperplane in this space.

Abstract talking aside, SVM with different kernels can split the data not by ordinary hyperplanes, but with more complex curves. The complexity of the curve is decided by the kernel type and by the arguments given to the algorithm, like the degree and coefficients, and the penalty for misclassifications.

• Classification trees become forests. The basic limitation of trees (in particular compared to linear models) is that they cannot sum up the evidence. When the decision requires taking into account hundreds or even thousands of small factors, trees won't do. Consider even ten binary variables: if each decision needs to take into account all of them, then every path from the root to a leave will have 10 steps, hence the tree will have 210=1024 leaves, which will require 1024 data instances just to have a single one in each leave (and we know that's far from enough). With 20 binary variables, we have 220≈1million leaves and with 30 we have 230≈1billion. With multiple values instead of binary, or with numeric features ... you get the picture.

Instead of a single tree, we construct many of them by somehow randomizing their induction. We can use different sample of the data or of variables, or change the algorithm to pick one of the best-rated attributes instead of the best one. We sometimes build entire trees, and sometimes severely limit them, like allowing just a few levels. Due to the algorithm's instability, we'll have very different trees. We create thousands of them and call it a random forest. We classify by voting, and we can also make better probability estimates by counting the number of trees voting in favour and against a particular decision.

• K-nearest neighbours. With a bit of a stretch, we related this to the naive Bayesian classifier. It's problem is naivety: it computes $P(x_1, x_2, \ldots, x_n|c)$ as $\prod_i P(x_i|c)$ because the latter can be easily estimated from the data and the former can't since the data probably does not contain any (or, at least, not enough) instances with values exactly equal to $(x_1, x_2, \ldots, x_n)$. The idea of k-nearest neighbours is to find the $k$ instances that are the most similar to $(x_1, x_2, \ldots, x_n)$. We make the prediction or estimate probabilities based on the classes of these $k$ instances.

Unlike all algorithms we have seen so far, this one does not construct a model but just stores the data. This kind of learning is called lazy learning.

Which algorithm to choose?

The first question is whether you want to be able to explain the decision.

• If your primary aim is to construct a model and observe it to find some patterns in your data, you can use linear models (with nomograms) or trees.

• If you need the model to help you make decisions, you probably want to know what the model's predictions are based on - you want the model to explain the prediction. In this case, use any of the above or maybe random forest (checking which trees voted for some class and which variables they used may provide some insight) or k-nearest neighbours (the explanation consists of the algorithm showing you the most similar data instances).

• If you aim at high accuracy, compute the accuracy of all models with some kind of cross validation (as in Test & Score widget), and pick the best model. Linear models (Bayes, Logistic regression) should work best for some simple data, kNNs may sometimes surprise, SVMs are good if its parameters are fit properly, and random forests are the method to bet on when there are many variables that give small pieces of information. An additional advantage of random forests is that they often don't need a lot of fitting - just construct a lot of (like one thousand) trees.

• ### Regularization

Linear regression can be made non-linear by adding columns with products of variables. This increases it capacity to fit the data, but at the same time its capacity to overfit. We have used a widget Polynomial regression in Orange's add-on Educational, where then curve through the points we draw in Paint goes through all the data at a cost of wildly steering up and down; we also observed the coefficients in a data table and saw that they get huge.

To counter this, we change the function that is optimized by the regression. Regression fits a model $\hat{y} = \sum_i\beta_i x_i$ by optimizing a loss function such as mean square error, $\sum_j (y_i - \hat{y_i})^2$. To this, we add a punishment for large coefficients, so we optimize $\sum_j (y_i - \hat{y_i})^2 + \alpha\sum_i \beta_i^2$. The first term punishes bad fits, the second punishes large coefficient. The factor $\alpha$ balances between the two parts: larger $\alpha$ forces the model to fit the data less, that is, to keep all coefficients small.

Alternatively, we can take absolute values of coefficients instead of squaring them, $\sum_j (y_i - \hat{y_i})^2 + \alpha\sum_i |\beta_i|$. This will encourage the model to set coefficients to zero instead of keeping them small (see https://youtu.be/sO4ZirJh9ds?t=232 for an explanation why this happens). This kind of regularization is thus useful for selecting subsets of features.

The first type of regularization is called ridge regression or L2, and the second is LASSO or L1.

Similar regularization can be used in classification, in particular in logistic regression.

We should take this as a concrete example of a general problem. In our first lecture, where we said it is impossible to decide what is the right number of bins in distribution, or the appropriate amount of smoothing. In modelling, we balance between overfitting and overgeneralizing (or undergeneralizing and underfitting). If we have a rather smooth decision boundary and some data instances are on the wrong side, do we curve the boundary (less generalizing, more fitting) or do we treat the data on the wrong side as pure noise in the data (and thus generalize more and fit less)?

Practically every model has a parameter for setting the balance, like the above $\alpha$.

• Classification trees: setting the requires number of data instances in leaves or sufficient proportion of majority class decides when to stop splitting the data and thus "smoothing" the decision boundary. Smaller trees are more general and less (over?)fit.

• Logistic regression: we have L1 and L2, just as described above.

• SVM: there are a bunch of parameters that set the complexity of the boundary and the misclassification penalty.

• k Nearest neighbours: the larger the number of neighbours, the more the model moves towards the mean/majority.

• Naive Bayesian classifier is usually implemented without balancing, though one can essentially regularize probabilities $P(c|x_1)$. We are not going deeper here since you will have a whole course dedicated to it.

• Random forest: this method is strange because there is no clear consensus about how and when it can overfit. It is surely a bad idea to have a small forest with large trees, but with small trees it seems that you can't overfit by increasing the number of trees.

In summary, with power comes responsibility. The more complex the model, the more it can overfit and the more you have to be careful avoid overfitting it.

• ### Clustering

We use predictive modelling (supervised learning) when our goal is to predict the value of dependent variable (also called target, outcome, class, label) based on independent variables (features, attributes, variables, properties). In a different setup, we do not have any particular target but just some objects with known properties (essentially independent variables) and our goal is to find some general patterns. Such methods belong to the field of unsupervised modelling.

One such task is finding separate groups in the data, if those exist. We call such groups clusters, and the procedure is called clustering.

We learned about two approaches. In hierarchical clustering, every data point represents a cluster and we iteratively merge clusters until arriving at a single cluster. The process results in a hierarchy of clusters. The hierarchy is presented with a dendrogram, a tree-like visualization in which the lengths of branches correspond to distances between clusters. Dendrogram also offers an insight into the number of clusters in the data: we usually set a cut-off at a point where the distances begin to grow more rapidly. One of the decisions we have to make is the definition of distances between two clusters of points: we may consider the closest pair (single linkage), the farthest (complete linkage), the average distance (average linkage) or another measure based on intra-cluster variance (Ward linkage). We will usually use the latter, unless there are some principal arguments for others.

The other approach we considered is k-means clustering. We first set the number of clusters ($k$). The method puts $k$ centroids at random position and assigns each data instance to the closest centroid, thus forming some clusters. Next it moves the centroid to the center of this cluster. Then it reassigns the data instances according to the new positions of the centroids. Then it again moves the centroids...

The two methods - hierarchical clustering and k-means clustering - differ in several imporant aspects.

Hierarchical clustering starts from a matrix of distances that do not necessarily correspond to any observable coordinates; they may represent, say, subjectively determined dissimilarities between people. K-means works with coordinates and computes (Euclidean) distances between them.

Hierarchical clustering provides a nice visualization, which can be used to inspect the quality of clustering and determine the number of clusters. K-means does none of this.

Hierarchical clustering consumes a lot of memory and may be slow. K-means is often faster, although it may get stuck in suboptimal positions. In practice, we restart it multiple times and take the best result.

Hierarchical clustering results in a hierarchy, and we decide the number of clusters after the procedure. In k-means we set it in advance. In practice, we tried with different numbers of clusters and take the one with the highest silhouette score - another topic of the lecture.

The chapter below describes all of this in more detail.

• ### Text mining

Text mining is a field of machine learning with some specific for dealing with text. As texts are complex objects, we need to transform them into a numeric structure before the analysis.

Preprocessing. Preprocessing is the first and a very important step and we can never go without it. For example, say that we have a sentence:

"I like working with computers."

Now, we can transform it in the following way:

• transform to lowercase: "i like working with computers."
• split into analytical units (called tokens): "i", "like", "working", "with", "computers", "."
• normalize words (transform them into base words): "i", "like", "work", "with", "computer", "."
• filter stopwords, punctuation: "i", "like", "work", "computer"
• filter by document frequency: remove words that appear in more than X and less than Y % of documents\

Some tokenization procedures can already discard punctuation. Otherwise we have to do it manually. Filtering by document frequency can be relative (word should appear in a certain percent of documents) or absolute (word should appear in a certain number of documents).

Bag of Words. The second step is Bag of Words, which transforms text (and the prepared tokens) into document vectors. A simple way to do it is to count the words, but a more elegant approach is term frequency - inverse document requency (TF-IDF), which ecreases the count of words which appear frequently across all documents and increases the count for those that are significant for a small number of documents.

$$\mathrm{TF} = \frac{\mathrm{occurences\ of\ word\ in\ doc}}{\mathrm{words\ in\ doc}}, \;\; \mathrm{IDF} = \frac{\mathrm{number\ of\ docs}}{\mathrm{docs\ that\ contain\ word}}$$

TF-IDF measure is the product of the two, $$\mathrm{TFIDF} = \mathrm{TF}\times \mathrm{IDF}.$$

Afterward we can do clustering, classification or any other analysis we wish.

Word Enrichment. Word enrichment is a nice way to inspect a data subset. It computes those words that are significant for a subset compared to the entire corpus.

Sentiment Analysis. Sentiment analysis is a popular approach for analyzing user opinion, product reviews, and so on. The most simple methods are lexicon-based, which means there is a list in the background that defines positive and negative words, then counts the occurrences of each and sums them. The sum is the final sentiment score.

• ### Projections and embeddings

Hierarchical clustering shows us hierarchical grouping, yet it does not tell us that much about the structure of the data. Some points from one cluster may be close to points from another, yet they fall into a different part of the hierarchy.

Multidimensional scaling (MDS) is technique that puts data instance onto a two dimensional plane (or, in principle, into a higher number of dimensions), so that the distances between points correspond to the given distances between objects as closely as possible. The optimization works by starting from a random (or a not-so-random, but precomputed by another method) position, and the iteratively move the points towards positions that more closely match the prescribed distances. It may get stuck in local optima, so we can restart the procedure. The resulting visualization allows us to manually observe any structure in the data, like groups, and relations between points. MDS is a method of (multivariate) data projections, but it is non-parametric: the position of the point does not depend on values of some variables, but on its relation with other points.

Principal component analysis is a method that finds a linear projection of data into a lower dimensional space. Just as three-dimensional data may actually lie on a plane (or within a relative small distance above or below it), 100-dimensional data may actually lie on, say, 13-dimensional hyperplane (with small distances from it). You will learn more about PCA in another class, so we presented it just intuitively and showed its application in Orange.

PCA is unsupervised: it does not take the target value (if there is any) into account. In the sense of the scatter plot: it is color blind. It may be the case that some 3d data lies more or less on the plane, so PCA would select these two dimensions and discard the third as explaining too little variance. The problem may be that we actually have a classification problem and the class depends exactly on whether the point is (slightly) below or above the plane. This may be solved by the partial least squares.