Solution: Classification boundaries
First Part
Consider the following classifiers:
- classification tree of the depth 2 (so called "stump")
- classification tree of the depth 3
- logistic regression
- SVM with RBF (radial basis function) kernel and g=1
- random forest with 100 trees
- nearest neighbors classifier with number of neighbors set to 5
For each of these classifiers paint:
- a data set where the classifier finds the "right" decision boundary
- a data set where the classifier failed to find the "right" decision boundary
Demonstrate A and B through screen shots of a Scatter Plots.
Second part
Demonstrate the effect of regularization strength for SVM with RBF kernels (modify the value of g) or Neural Networks (modify the number of layers and the number of neurons per layer).
Solution
We need this schema. We set the scatter plot to color the points according to the classification and to use shapes for representing the original class. Ideally, we see red crosses and blue circles.
The homework has many possible solutions. Here are a few that demonstrate some interesting cases:
model | works | doesn't work |
---|---|---|
Classification trees of depth 2 will cut the space twice, with both cuts perpendicular to the axes. It can't make a diagonal cuts - not even a single one. | ||
Classification trees of depth 3 will cut the space three times, with both cuts perpendicular to the axes. It can handle the c-shaped drawing here; the two-level couldn't do it. On the other hand, no tree can (efficiently) handle separation along sloped lines. | ||
Logistic regression draws a single line at arbitrary slope without a blink. On the other hand, the line it draws is always straight, and it's always a single line. It made a good effort with the picture on the right, though: it found a line that almost separates between the reds and blues. | ||
SVM with RBF kernels, random forests with 100 trees and k-nearest neighbours will draw complex boundaries that can separate anything. On the right picture, at the bottom left, there is something they do not do: the red circle is so close to the other group that they consider it a noise and put in the other group. |
Whether the model will correctly classify the example in the last picture depends upon the strength of regularization (smoothing) it uses. Stronger regularization prevents the algorithm from making the model more complicated just to cover a few additional, weirdly located points — it treats such points as noise.
This is easiest to observe with the Nearest neighbors classifier. If the number of neighbors considered when making the prediction is small, the classifier correctly classifies all points. If the number is increased, the points lying closer to the regions covered with the other color are misclassified.
The behavior of the model does not depend only on its type but also on its settings.
We shall demonstrate the effects of regularization with the following workflow:
In the Confusion Matrix widget, we click "Select Correct", so they will be represented as filled symbols in the Scatter Plot. Let us see what SVM does for this data:
SVM with RBF kernel is regulated by two parameters: misclassification cost (C) and kernel width (gamma).
Misclassification cost sets the balance between the curve complexity and the misclassification rate. With a higher cost, the fitting algorithm will be more prone to created a more complex boundary to decrease the number of misclassifications — potentially overfitting to the data. With the default setting, SVM creates a boundary that perfectly splits the data we painted. If we decrease the cost to 0.1, we get the following.
The boundary is simplified by cutting off the three blue instances at the bottom and a group of red points surrounded by the blue. With such a low misclassification cost, the curve simplicity has priority over a few misclassifications.
Let us now set the cost to 0.5 and play with gamma. The gamma argument defines the inverse kernel width; the lower the value, the wider the influence of each data point. If we set gamma to 0.01 (an extremely low value), we essentially get a majority classifier for this data. The influence of every point is so wide that the blue points, which are in the majority, overwhelm the red and all red instances are misclassified as blue.
Increasing gamma to 0.1 gives the following.
The strong group of red points at the bottom extends their influence over the blue points at the far end of the blue region, while the points enveloped by the blue are misclassified as blue. The most interesting part is at the top right, where those of the red points that are too close to the great number of the blue are misclassified as blue, while those farther away from the blue are classified as red.
For another example, compare the following two images in which gamma is 1 (left) and 0.1 (right); to get this image, the Scatter plot has to be connected to Predictions and colored by SVM.
The left image has a more appropriate value for gamma, which results in a nicely shaped (not too simple, not too complex) regions. Decreasing gamma increases the kernel width, which expands both, the blue and the red regions in the parts where the opposing class has a smaller density: the blue region mostly swells, except at the bottom, where the red expands across the tip of the blue region.