주제 이름 모듈 소개
웹페이지 Much Further Reading

Every section below contains a few papers (or even Wikipedia pages) that easy to read, without much math. Some students asked more advanced questions (e.g. relation between Bayesian modeling and logistic regression), so here is a list of some more advanced books.

Exam 파일 Data for the third part
Visualizations (and Getting to Know Orange) 웹페이지 Exercise (visualizations)
파일 Mushrooms
웹페이지 Exercise (insignificance of significance)
URL Orange and basic visualizations
URL Mosaic and Sieve diagram
(If the video doesn't play: it happens to me, too. There seem to be something wrong on YT's side. I hope it resolves, otherwise I'll reupload.)
URL Task solutions
URL Arguments against testing of null hypotheses
URL Of Carrots, Horses and the Fear of Heights
A blog post about how statistical tests work and why we have to be very careful when using them in data mining.
URL How to Abuse p-values in Correlations
A shorter and drier version of the same.
파일 Cohen (1994): The Earth is Round (p < 0.05)
A famous, juicy paper with general arguments against null-hypothesis testing.
Introduction to predictive modelling 웹페이지 Surviving on mushrooms
웹페이지 Recognizing types of animals
파일 Animals
웹페이지 Exploring Human Development Index
파일 Human development index (+ religions + continents)
URL Classification trees
URL Decision tree learning (Wikipedia) [mandatory read, but see remark]

The page contains the crux of the lecture. Its title and the fact that the first link on the page points to a completely unrelated kind of decision trees demonstrate why classification tree is a better term than decision tree. [Mandatory reading, with grain of salt; you most certainly don't need to know that "It is also possible for a tree to be sampled using MCMC." :) ]

URL Information Gain in Decision Trees (Wikipedia) [optional reading]

We have spent a lot of time explaining the concept of entropy. Wikipedia page is rather mathematical and dry, but may be a good antidote to less formal and less organized exposition at the lecture. :)

URL Induction of Decision Trees (Quinlan, 1986) [optional reading]

Quinlan is the author of one of the first and most influential algorithm for induction of classification trees. The article is more of historical interest, but it shows the thinking of the pioneers of AI. After some philosophy in the first two sections, it explains the reasoning behind the tree-induction algorithms.

Model Performance 웹페이지 Scores for evaluation of models
파일 mushroom-predictions
파일 Sara's Hamsters
파일 Sara's Hamsters - solution
웹페이지 Cross validation
URL Scores for evaluation of model performance
URL List of performance scores (Wikipedia)
Lists all kinds of scores, useful as reference
URL Cross validation (Wikipedia) [optional]

Use this page as a list of different sampling techniques.

URL An introduction to ROC analysis (Fawcett, 2006) [mandatory: first seven sections]

A very accessible paper about ROC curves.

URL A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss [just the Introduction; optional]
Linear models for classification 웹페이지 Recognizing mushrooms - again
파일 Mushrooms (numeric)
웹페이지 Decision boundaries
URL Linear models for classification
URL Logistic regression (Shalizi, 2012) [mandatory, see below which parts]

A more mathematical (compared to our lecture), but still friendly explanation of logistic regression. Read the first 6 pages, that is, section 12.1 and the (complete) section 12.2. You don't have to know the formulas, but you need to understand their meaning.

(This is Chapter 12 from Advanced Data Analysis from an Elementary Point of View. Download the draft from the author's site while it's free.)

파일 Nomograms for Visualization of Naive Bayesian Classifier (Možina, 2004) [mandatory, you may skip Section 2]

A quick derivation of the Naive Bayesian classifier, and derivation and explanation of nomograms.

웹페이지 Nomograms for Linear Models
Other types of classifiers 웹페이지 Exploration of Kernel Methods
URL Other models
URL A nice explanation of the kernel trick
URL Kernel Methods for Pattern Analysis (Shawe-Taylor, Christiannini, 2004) [optional, beyond this course]

The best-known book about kernel methods like SVM. Warning: lots of mathematics. Not a required reading for this class.

URL Random Forests (Breiman, 2001) [optional]

Contrary from SVM, random forests are so easy to explain and understand that they don't require additional literature. But if anybody is interested, here's the definitive paper about them.

파일 The Random Subspace Method for Constructing Decision Forests (Ho, 1998) [optional]

... and this is the paper from the less-known inventor of the method. It was Breiman (above paper) though, who thoroughly examined the method and gave it a name. Neither this paper nor the above is a required reading, though.

Regularization 웹페이지 Regularization Experiment
URL Regularization
URL Elements of Statistical Learning [optional, way beyond this course]

We are just telling you about this book because we must do it at some point. It is too difficult for this course, but it provides an overview of machine learning methods from statistical point of view. Chapters 4 and 5 should not be too difficult, and you can read them to better understand linear models and regularization.

You can download the book for free.

Clustering 웹페이지 Clustering versus Classification
웹페이지 Exploration of linkage functions
파일 Data sets for clustering
웹페이지 Exploration of Dendrograms
웹페이지 Exploration of Clusters
URL Clustering (part 1: k-means and hierarchical clustering)
URL Clustering (part 2: linkages, distances)
URL Introduction to Data Mining, Chapter 8: Cluster Analysis: Basic Concepts and Algorithms (Tan P-N, Kumar, 2006)

Obligatory reading: sections 8.2 (you may skip 8.2.6), 8.3 (skip 8.3.3), The Silhouette Coefficient (pg. 541). Everything else is also quite easy to read, so we recommend it.

Text mining 웹페이지 Fake news
URL Text Mining
URL Text Mining - In-class assignment
URL Allahyari et al. - A Brief Survey of Text Mining

Comprehensive overview of text mining techniques and algorithms. [obligatory]

URL Bird and Klein: Regular Expressions for Natural Language Processing
Why regular expression can be very helpful. [optional read]
URL Ramos: Using TF-IDF to Determine Word Relevance in Document Queries

Why using TF-IDF is a good idea. [technical, interesting read]

URL An opinion word lexicon and a training dataset for Russian sentiment analysis of social media
파일 Text Mining course notes

Notes from the text mining lecture.

URL Liu, Bing: Sentiment Analysis and Opinion Mining

A book on sentiment analysis and opinion mining. Freely available in the link.

Projections and embeddings URL Projections
URL Deep learning and images
파일 Distances
URL Analysis of Multivariate Social Science Data, Chapter 3: Multidimensional Scaling (Bartholomew, 2008) [recommended]

The chapter is particularly interesting because of some nice examples at the end.

URL FreeViz—An intelligent multivariate visualization approach to explorative analysis of biomedical data (Demšar, 2007) [optional]

See the example in the introduction. You can also read the Methods section, if you're curious.

Embeddings ... and a practical case 파일 animals and fruits
Assignments 웹페이지 Assignment: ROC Curve
웹페이지 Assignment: Regression
웹페이지 Assignment: Classifiers and their Decision Boundaries
웹페이지 Solution: Classification boundaries
웹페이지 Solution: ROC curve
웹페이지 Solution: Regression