## Topic outline

### Exploratory Analysis and Clustering

Attribute-based data sets. Preparing and loading the data into Orange Data Mining software. Data analysis workflows. Scatterplot and box plot. Hierarchical clustering: distances between data items, distances between clusters, agglomerative approach to data clustering. Cluster explanation.

Video lectures: Orange workflows, data exploration, workflow management, your own data, clustering-theory, clustering in 2d, clustering of multi-dimensional data, and clustering of zoo data set.### Regression Models and Regularization

Linear regression. The shape of the model. Optimization function. Polynomial expansion. Overfitting. Regularization. Accuracy on training and test set. Evaluating the accuracy of regression models. Feature scoring and selection.

Video lectures: introduction to regression, linear regression, overfitting, regularization, training and test sets, L1 and L2 regularization, and model scoring with RMSE and R2.

### Classification Models

Prediction models and how they differ from clusterings. Classification trees as an example of an intuitive, early prediction model. Naive Bayesian model as efficient, yet limited model. Linear models, e.g. logistic regression.- (In the lecture, we used the term "classification tree" to avoid confusion with another, unrelated trees of the same name.)
- Quinlan is the author of one of the first and most influential algorithm for induction of classification trees. The article is more of historical interest, but it shows the thinking of the pioneers of AI. After some philosophy in the first two sections, it explains the reasoning behind the tree-induction algorithms.
More mathematical (compared to our lecture), but still friendly explanation of logistic regression. I recommend reading the first 6 pages, that is, section 12.1 and the (complete) section 12.2.

(This is Chapter 12 from Advanced Data Analysis from an Elementary Point of View. You can download the draft from the author's site.)

- A quick derivation of the Naive Bayesian classifier, and derivation and explanation of nomograms.