
# INTRODUCTION TO CLASSIFICATION

Work plan:

* Read in the data
* Split the data into **training** and **test** sets
* Build a model using the **training** set
* Evaluate the model using the **test** set


Please download the data file "players.txt" into a local directory.

We are going to use the sklearn library. 

In [None]:
#%pip install sklearn

In [None]:
import sklearn
import pandas as pd
import numpy as np

In [None]:
players = pd.read_csv("players.txt")
players

In [None]:
players.describe()

# ATTENTION

The "id" attribute uniquely identifies a player within our data set. 
This attribute cannot be used to classify new players as each player has a different id number. It is always important to check whether the data contains unnecessary attributes.

Before we continue, we'll remove the "id" attribute from our data set!

In [None]:
players = players.drop('id', axis=1)
players

# Example 1 - Prediction of playing position


We want to build a model to predict a player's playing position 
with respect to the given player's statistics. The target variable "position" is discrete - we term this a classification task. We aim to verify whether or not it is possible to use historical data to predict
playing positions for new players. 


We are going to split the data into a training and testing data set.
The training data set consists of players that ended their careers before 1999.
The test data set consists of players that began their careers after 1999.


In [None]:
players['lastseason'] <= 1999
train = players[players['lastseason'] <= 1999]
test = players[players['firstseason'] > 1999]
print("Number of training examples:", len(train))
print("Number of test examples:", len(test))
print("ratio of train examples:", len(train)/(len(train)+len(test)))

We used the "firstseason" and "lastseason" attributes to split the data. Therefore the attributes are not going to contribute to the modelling task, so we will remove them.

In [None]:
train = train.drop('lastseason', axis=1)
train = train.drop('firstseason', axis=1)

test = test.drop('lastseason', axis=1)
test  = test.drop('firstseason', axis=1)

Lets inspect the player positions in each set

In [None]:
train['position'].value_counts()

In [None]:
test['position'].value_counts()

# Majority classifier

The majority class is the class with the highest number of training examples. This is the simplest classifier can be used as a baseline for comparison with other classifiers.

In [None]:
majority_index = train['position'].value_counts().argmax()

majority_class = train['position'].value_counts().index[majority_index]

print(majority_class)

How well does this perform on the test set?

We can evaluate model performance using classification accuracy (the percentage of correct predictions on the test set).

In [None]:
correct_predictions = test[test['position'] == majority_class]
CA = len(correct_predictions) / len(test)
CA

# Decision Tree

The goal of decision trees is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Each node of a decision tree splits the dataset according to a decision rule.

## Some advantages of decision trees are:

* Simple to understand and to interpret. Trees can be visualized.
* Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Some tree and algorithm combinations support missing values.
* The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
* Able to handle both numerical and categorical data. However, the scikit-learn implementation does not support categorical variables for now.
* Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.

## Some disadvantages of decision trees include:

* Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
* Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
* There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
* Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

In [None]:
from sklearn import tree

All sklearn classifiers can be used with the following steps:

* Create a classifier object
* Fit the classifier with .fit(X, Y) on the TRAIN set
* Obtain the predictions with .predict(X) on the TEST set

In [None]:
X_train = train.drop('position', axis=1)    # Everything except the class
Y_train = train['position']                 # Just the class

X_test = test.drop('position', axis=1)
Y_test = test['position']

Error - our dataset contains missing values, which can cause errors when training classifiers. We remove missing values before training a classifier

In [None]:
train = train.dropna()
test = test.dropna()

In [None]:
# max_depth controls the maximum depth of the decision tree. Smaller trees are less complex but can help prevent overfitting
clf = tree.DecisionTreeClassifier(max_depth = 3)   
clf.fit(train.drop('position', axis=1), train['position'])

Visualize the tree

In [None]:
column_names=list(train.drop('position', axis=1).columns)

In [None]:
print(tree.export_text(clf, feature_names = column_names))

After training, obtain predictions with .predict()

In [None]:
preds = clf.predict(X_test)
preds

How good is the model?

Let's calculate the classification accuracy.

In [None]:
correct_predictions = (preds == Y_test)
correct_predictions

In [None]:
CA = sum(correct_predictions) / len(preds)
CA

We can examine the results further using a confusion matrix

In [None]:
# true vaule / predicted value
#    G   F   C
# G  x   x   x
# F  x   x   x
# C  x   x   x 

In [None]:
conf_mat = sklearn.metrics.confusion_matrix(preds, Y_test)
print(conf_mat)

We can also obtain the CA by summing the diagonal of a confusion matrix (the correct predictions) and dividing it with the sum of the whole matrix

In [None]:
np.sum(np.diag(conf_mat)) / np.sum(conf_mat)

Addionally, we can obtain class probabilities instead of discrete predictions.

This can be used to calculate alternative scores, such as the Brier score, which also takes into account class probabilities and returns a value between 0 (best) and 1 (worst).

In [None]:
preds_proba = clf.predict_proba(X_test)
preds_proba

In [None]:
# Encode ground truth classes to bits
# Sklearn encoders can be used to change data in various ways
encoder = sklearn.preprocessing.OneHotEncoder()
# Encoders require a 2D array so reshape is needed
Y_test_proba = encoder.fit_transform(np.array(Y_test).reshape(-1, 1))
Y_test_proba.todense()

In [None]:
def brier_score(preds, ground_truth):
    return np.sum((ground_truth - preds) ** 2) / len(preds)

In [None]:
brier_score(preds_proba, np.array(Y_test_proba.todense()))

# Example 2

Does a player make more than 80% of free-throws attempted?

This is a binary problem (the target variable is discrete with values YES and NO). We do not have this attribute, so we will need to calculate it.

In [None]:
# Select only players who have made at least 1 free throw
bin_players = players[players['fta'] > 0].copy()

# Calculate free-throw success rate
free_throw_rate = bin_players['ftm'] / bin_players['fta']

# Create a discrete attribute "ftexpert". This will be our target variable.
ftexpert = pd.cut(free_throw_rate, [-1, 0.8, 1], labels=["NO", "YES"])

# Assign the new attribute to a new column
bin_players['ftexpert'] = ftexpert

bin_players = bin_players.dropna()

**Important:** The new ftexpert attribute is based on two existing attributes (fta, ftm). Therefore, it would be very easy to predict if we left these two attributes in the dataset. These two attributes must be removed.

**Important:** DecisionTreeClassifier needs all attributes to be numbers. Here, the "position" attribute is a string (either "G", "C", or "F") so it must first be converted to a number. This can be done using pd.get_dummies()

In [None]:
bin_players = bin_players.drop('fta', axis=1)
bin_players = bin_players.drop('ftm', axis=1)

In [None]:
#bin_players['position'] = bin_players['position'].astype('category')
bin_players = pd.get_dummies(bin_players, columns = ["position"])


In [None]:
bin_players.columns

In [None]:
bin_players[['position_C','position_G','position_F']]

Split the data into training and testing sets. This time using a built-in sklearn method.

In [None]:
X = bin_players.drop('ftexpert', axis=1)
Y = bin_players['ftexpert']
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, train_size = 0.8)
print(len(X_train), len(X_test), len(X_train)/(len(X_train) + len(X_test))) 

Inspect the distribution of the class variable

In [None]:
print(Y_train.value_counts())
print(Y_test.value_counts())

Train a decision tree

In [None]:
bin_clf = tree.DecisionTreeClassifier(max_depth=5)
bin_clf.fit(X_train, Y_train)
preds = bin_clf.predict(X_test)
preds

Calculate accuracy using sklearn

In [None]:
sklearn.metrics.accuracy_score(preds, Y_test)

Compare to majority (using sklearn DummyClassifier)

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
majority_classifier =  DummyClassifier()
majority_classifier.fit(X_train, Y_train)
majority_preds = majority_classifier.predict(X_test)
sklearn.metrics.accuracy_score(majority_preds, Y_test)

# Sensitivity, Specificity and ROC curve

These three metrics can be used to further examine classification results.

* Sensitivity: Correct positive predictions (TP) out of all positive examples (P): TP/P
* Specificity: Correct negative predictions (TN) out of all negative examples (N): TN/N


In [None]:
print("Sensitivity", sum((preds == 'YES') & (Y_test == "YES")) / sum((Y_test == "YES")))
print("Specificity", sum((preds == 'NO') & (Y_test == "NO")) / sum((Y_test == "NO"))) 

We can also obtain a set of scores using sklearn.metrics.classification_report. Here, sensitivity and specificity are present under different names. The relevant row is "YES", which shows the results if YES is treated as the positive class. In this function, sensitivity is reffered to as recall while specificity is the recall if we treat "NO" as the positive class (in the "NO" row).

We also see some additional scores:

* Precision, which is the number of correct positive predictions out of all positive predictions: TP/(TP + FP)
* F1-score, which is the harmonic mean of precision and recall ((2 * precision * recall)/(precision + recall))
* Support, which is the number of occurances of each class

In [None]:
print(sklearn.metrics.classification_report(Y_test, preds))

Since Sensitivity and specificity are based on the number of positive and negative predictions, we can vary both by changing the prediction threshold.

First, predict probabilities instead of labels:

In [None]:
preds_proba = bin_clf.predict_proba(X_test)
preds_proba

With the basic .predict function, the threshold for a prediction is 0.5. If the probability of a class is >= 0.5, that class is predicted.

In [None]:
no_threshold = 0.5
# x[0] is the first value in each row of preds_proba, corresponding to the no class
threshold_predictions = ['NO' if x[0] >= no_threshold else 'YES' for x in preds_proba]   
all(threshold_predictions == preds)    # Our new predictions are the same as predictions obtained with .predict

If we want more positive predictions, we can increase the threshold for the "NO" class. This will lead to less negative and more positive predictions.

In [None]:
from collections import Counter

In [None]:
no_threshold = 0.8   # Now only examples where the probability for the no class is >= 0.8 will be labelled as "NO"
threshold_predictions = np.array(['NO' if x[0] >= no_threshold else 'YES' for x in preds_proba])
print("Threshold predictions", Counter(threshold_predictions))
print("Predictions from .predict()", Counter(preds))

More positive predictions means we (likely) increase the number of true positives (TP) while the number of all positives (P) remains the same. Therefore, the sensitivity (TP/P) increases. The reverse is true for sensitivity.

In [None]:
print("Sensitivity", sum((threshold_predictions == 'YES') & (Y_test == "YES")) / sum((Y_test == "YES")))
print("Specificity", sum((threshold_predictions == 'NO') & (Y_test == "NO")) / sum((Y_test == "NO"))) 

We can visualize all possible ratious between sensitivity and specificity using a ROC curve

In [None]:
from sklearn.metrics import RocCurveDisplay

In [None]:
# ROC curve in text form (less common)
fpr, tpr, thresholds = sklearn.metrics.roc_curve(Y_test, [x[1] for x in preds_proba], pos_label="YES")
print(fpr)
print(tpr)
print(thresholds)

# ROC curve in graph form
RocCurveDisplay.from_estimator(bin_clf, X_test, Y_test)

In [None]:
# Or from results
# The function takes only the probabilites of the positive class, hence [x[1] for x in preds_proba]
RocCurveDisplay.from_predictions(Y_test, [x[1] for x in preds_proba], pos_label="YES")

In [None]:
# Just the AUC score
sklearn.metrics.roc_auc_score(Y_test, [x[1] for x in preds_proba])

A bigger area under the curve (AUC - area under curve), the better the results. The top left point would indicate a scenario where both sensitivity and specificity equal 1 - the perfect result.