Section outline

  • The purpose of visualization could be to:

    • provide nice illustrations for newspaper articles;
    • help us discover patterns in the data.

    In data mining, we use them for the latter (except when trying to impress our bosses). We have not evolved as number crunching animals, so our brains are specialized for visual patterns. Given a sequence of numbers, we are not able to find patterns, like trends and correlations. Seeing a picture, we are. So the main purpose of insight-giving visualizations are to present the data in such a form that our brains can spot things that we would not be able to spot if the data was presented as just numbers (or whatever human-unfriendly form the data has).

    Here is a brief summary of what we covered.

    • Box Plot: shows basic statistics, good for observing a single variable. We can split it into multiple box plots by a categorical variable. In this case, we can compute chi squares or ANOVAs and order the variables by how well they variable splits the categories.

    • Distribution: shows distributions as histograms. We can again split them by a categorical variable. We emphasized the problem of choosing the number of bins, which can strongly affect the visualization. Curves also have to be considered carefully since they are either result of some smoothing, which has some parameters that are hidden or difficult to set and interpret, or they result of fitting of parameters to a shape that we have to decide -- and often have no good argument for a particular shape. Bins are essentially better because they do not hide anything -- unless the criteria for deciding the number of bins or boundaries are manipulated.

    • Scatter Plot: plots the relationship between two numeric variables. To show other quantities in the same plot, we can use colors, sizes, shapes or other fancy properties, such as angles. Out of these three, colors are the only one that really works. It is easy to see that a certain region has a prevalence of certain colors. (Sometimes this is too easy to see. We are very able to spot patterns where there are none. We'll see some examples in due time.)

      A note about using colors for showing numeric quantities. It is a very good idea to use discrete scale with not too many colors. An average human can distinguish roughly five different colors, except for women who can distinguish about a dozen thousands. If the scale of colors is continuous, it is impossible to compare a color in a legend with a color of the spot on the graph. While continuous scales appear to be more exact than discrete scales with, say, eight different colors, the latter are probably better.

      You have learned about other traps (such as using sizes) in the other course.

    • Mosaic Display: splits the data by 1-4 variables into ever smaller subgroups, such that the size of the area corresponds to the size of the group. The area can be split further to show the target variables distribution inside the group. This visualization is useful for finding relations between categorical features in a similar fashion as the scatter plot does for numeric.

    • Sieve Diagram: essentially a visualization of chi-square, it plots actual frequencies of two variables against their expected frequencies. Unlike in the Mosaic display, the area is split independently along each axis, so sizes correspond to the expected number of instances assuming that the variables are independent. Sieve then shows - using a grid and colors - the violations of independence, that is, combinations of values that are more common or rare than expected.