Section outline

  • We considered a simple machine learning algorithm: induction of classification trees. Classification trees are models that can predict the value of the target variable (outcome, class) for a given data instance. Trees can be induced from data using a suitable algorithm. We used a top-down approach that divides the data into ever smaller subsets, at each step using the most informative variable as a splitting criterion.

    Models must generalize, that is, they must be able to make predictions for data instances such that were never seen before. In order to generalize, they must "smooth" over the data, ignoring potential errors or specific cases. If we take trees as an example: they must be neither too large (overfitting to every specific data instance) nor too small (too general).

    The amount of smoothing is regulated using certain parameters of the fitting (learning) algorithm. In case of trees, we set the minimal number of data instances in leaves, maximal tree depth, the proportion of majority class at which we stop dividing etc.

    Pioneers of AI liked classifications trees because they believed they mimic human reasoning. While they may be interesting for historic reasons, they are of little practical importance today. They are useful, though, to illustrate some basic principles: they introduced us to some basic ideas that we will keep encountering as we proceed on, to more complex models.