Introduction to predictive modelling
Section outline
-
We considered a simple machine learning algorithm: induction of classification trees. Classification trees are models that can predict the value of the target variable (outcome, class) for a given data instance. Trees can be induced from data using a suitable algorithm. We used a top-down approach that divides the data into ever smaller subsets, at each step using the most informative variable as a splitting criterion.
Models must generalize, that is, they must be able to make predictions for data instances such that were never seen before. In order to generalize, they must "smooth" over the data, ignoring potential errors or specific cases. If we take trees as an example: they must be neither too large (overfitting to every specific data instance) nor too small (too general).
The amount of smoothing is regulated using certain parameters of the fitting (learning) algorithm. In case of trees, we set the minimal number of data instances in leaves, maximal tree depth, the proportion of majority class at which we stop dividing etc.
Pioneers of AI liked classifications trees because they believed they mimic human reasoning. While they may be interesting for historic reasons, they are of little practical importance today. They are useful, though, to illustrate some basic principles: they introduced us to some basic ideas that we will keep encountering as we proceed on, to more complex models.
-
The page contains the crux of the lecture. Its title and the fact that the first link on the page points to a completely unrelated kind of decision trees demonstrate why classification tree is a better term than decision tree. [Mandatory reading, with grain of salt; you most certainly don't need to know that "It is also possible for a tree to be sampled using MCMC." :) ]
-
We have spent a lot of time explaining the concept of entropy. Wikipedia page is rather mathematical and dry, but may be a good antidote to less formal and less organized exposition at the lecture. :)
-
Quinlan is the author of one of the first and most influential algorithm for induction of classification trees. The article is more of historical interest, but it shows the thinking of the pioneers of AI. After some philosophy in the first two sections, it explains the reasoning behind the tree-induction algorithms.