Model Performance
Section outline
-
We saw a model that was able to predict randomly assigned data labels. Since this is clearly impossible, something must had been wrong in our procedure.
We discovered that the tree essentially memorized the data: it split the space of data instances (you can imaging this space as a scatter plot) so that each part (a leaf of the tree) contained just a few data instances, all belonging to the same class. When making predictions, the tree just checked the region to which the data instance belongs and recalled the correct class.
To avoid such situations, we should never ever estimate the quality of the tree by measuring its performance of the data that was used for learning the tree (also called training data), but always retain separate data for testing. We discussed different sampling techniques, the most known of which is cross validation.
Next we focused on measuring the quality of models. We can use them for (at least) two purposes to make predictions, or to get some insight into the data.
For the latter, a good model is a model that gives us some useful insight. We should treat models as hypothesis, as ideas, and test them against our background knowledge (does it make sense? does the tree use attributes that are related to the problem? ...)
For the former, we have different performance scores. Classification accuracy is the simplest, but it ignores the types of mistakes; classifying a sick person as healthy is (usually, at least for most observers :) worse than vice versa. Hence we have derived a number of scores that measure probabilities of various types of mistakes, such as the probability that a sick person will be classified as sick (sensitivity or recall), and the probability that a person classified as sick is actually sick (precision). There are dozens of such combinations. It is less important to know their names than to be aware that the score we are using should correspond to our goals.
Many of the above measures appear in pairs: by changing the decision threshold, we can improve one at a cost of another. To visualize this, we use curves, such as ROC curve.
Everybody should draw a ROC curve manually at least once in his life. We've done so in the lecture. This led us to think about other interesting properties of the curve and especially the interpretation of the area under it. All that we've done and said can be found in the paper we cite below An introduction to ROC analysis. To test our understanding, we solved a problem faced by Sara, a hamster veterinarian.
The most important message here was that when making predictions, we have to set a decision threshold and the threshold usually balances between two opposing quantities (sensitivity-specificity, precision-recall), and can also take costs into consideration. In practice we'd often construct other types of curves to show the possible operating points.
-
Use this page as a list of different sampling techniques.
-
A very accessible paper about ROC curves.