Proud Dad, Software Developer, Hobby Data Scientist, Degenerate Gambler. Follow me on Twitter, Medium or connect on LinkedIn.

Brownlow Modelling - KNN Classifier

What is KNN

I decided to start this blog post series off with the KNN Classifier because it is easy to understand conceptually. KNN stands for K-Nearest Neighbours and in essence it looks at a data point, and then looks at the N closest other data points (where N is a number defined by us) to determine how to classify it.

Imagine we have 1,000 data points of players, their match stats and how many brownlow votes each player got for those stats. We know things like Disposals, Kicks, Clearances, Goals etc. Now we get given match stats for a player and are asked to predict how many votes that player will get. The KNN model would look at which of the N data points we’ve already seen are closest, and take a weighted average of the votes from those players, depending on how closely they resemble the player we are analysing.

Example: Player A got 35 disposals, 3 goals and 10 clearances. We’d look at all the historical players who got 35 Disposals, 3 goals and 10 clearances and see how many votes they got. Players who had 40 disposals, 3 goals and 15 clearances will have less impact on our classification that those who got 34 disposals, 10 clearances and 3 goals (as it is less similar).

A more detailed description of KNN can be found here.

Running The Model

Before getting too carried away I like to just run a baseline model to try and iron out any kinks in the data structures and get a feel for how the model behaves. When using a library like scikit-learn running models is cheap and easy. We’ll be able to run lots of tests on lots of models to see what works and what does not. Once we have that information it becomes easier to brainstorm ways to further increase predictive power and overcome inherent limitations of the problem we are solving.

Note: I’m going to leverage some helper methods from an earlier post as we go along.

# Models/knn.py

# scikit-learn models
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Helper methods we created earlier
from PreProcessing.load_data import all_data_without_contested_possession_breakdown
from Helpers.visualisations import plot_confusion_matrix


def knn(neighbours):
    """
        Basic implementation of KNN algorithm to get a baseline confusion matrix for brownlow predictions.
    """
    # Load the raw training data from our helper method
    features, targets = all_data_without_contested_possession_breakdown()
    
    # Utilise sklearn's helper to split data into training/validation sets
    X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.3, random_state=0)
    
    # Initialise the KNN Classifier Model
    model = KNeighborsClassifier(n_neighbors=neighbours)

    # Train the model on the training data
    model.fit(X_train, y_train)

    # Get Predictions for the validation set
    predictions = model.predict(X_test)

    # Generate Confusion matrix for trained model based on validation performance
    plot_confusion_matrix(y_test, predictions, 'KNN - No sampling, 10 Neighbours')

    ## Return the model so we can do stuff with it, along with some accuracy and precision metrics
    return model, classification_report(y_test, predictions)

The code snippet above show just how easy it is to train a model using sklearn packages in Python. Once we had our data separated into feature sets (match stats), and target variables (Brownlow Votes), all we had to do was create a new instance of our model and call the fit() method, passing it our training data.

To run this model we just need to create an app.py with the following code

from Models.knn import knn

if __name__ == "__main__":
    knn_model, metrics = knn(neighbours=10, sample_probability=10)
    print(metrics)

then from our command line we can just run

python app.py

In a few lines of code we’ve training a classification model and can use it to predict brownlow votes from any new data (e.g. 2019 Match Data). Before we get too carried away with our genius, let’s have a look at how the model performed.

Preliminary Results

Confusion Matrix

The confusion matrix for the base run is shown below. There are some interesting things we can glean from this. Firstly though, let’s go over how to read a confusion matrix.

This matrix has been normalised which means it shows the percentage of classifications, rather than the total count of classifications - meaning a value of 1.0 means 100% and 0.42 means 42% etc.

Example Confusion Matrix

Reading the confusion matrix is first done by reading the row (Y axis) and then the column (X axis) and then finally looking at the number in that cell. In essence the confusion matrix expresses results in the form of:

For a true label of X, a label of Y was predicted Z % of the time.

For example. The top left most square is read as:

For a true label of 0, a label of 0 was predicted 100% of the time.

This means that in every instance in the validation set, players who actually got 0 votes were predicted to get 0 votes 100% of the time. In other words, the model never predicted a single vote for a player that didn’t actually receive a vote.

Now that we understand how to read a confusion matrix, it should be apparent that all the rows must sum to 1.0. This is because if 100% of out 0 vote players are predicted to get 0 votes, then how can we also predict 0 vote players getting 1,2 or 3 votes?

While we might thikn that 100% accuracy on 0 vote games is good, let’s move onto the other labels (those that represent players who actually got votes.)

We can see that for players who polled 1 vote in the validation data, 87$ of the time, they were predicted to get no votes! Similarly, 78% of players who actually polled 2 votes were predicted to get none, and 62% of players who actually got 3 votes, were predicted to get none! This model seems kind of useless as it stands at the moment. Interestingly 26% of players who got 3 votes are actually being predicted to get 3 which isn’t too bad relative to the rest of the model.

The confusion matrix is showing us that most players are predicted to get 0 votes, regardless of how many votes they actually got. There is a very simple explanation for this!

Most players in a game of AFL will not get any Brownlow votes. In fact only 3 players across both teams will get any. This means that when our model was trained, most of the data it saw was for 0 vote players. That means that the model could be fairly accurate just by blanket estimating everyone to get 0 votes.

Accuracy and Precision

Accuracy (How many of X was labelled as X) and precision (How many labelled as X were actually X) are useful for quickly analysing the results too.

Here’s the accuracy/precision of each label for this model

0: 1.00 / 0.95 (All 0’s were predicted as 0’s, but only 95% of those predicted as 0 actually were 0) 1: 0.02 / 0.18 (2% of 1s were predicted as 1s, 18% of those predicted as 1s were actually a 1) 2: 0.07 / 0.27 (7% of 2s were predicted as 2s, 27% of those predicted as 2s were actually a 2) 3: 0.28 / 0.52 (28% of 3s were predicted as 3s, 52% of those predicted as 3s were actually a 3)

Given that 94.3% of the data is labelled as a 0 vote data point, the actually accuracy of this model is not that good. It predicts some obvious 3 vote games but mostly assumes people won’t poll.

Tweaking the Model

We can make a small tweak here that will improve model performance dramatically. Let’s remove 90% of our 0 vote data points from the training set. This should mean there are roughly as many data points where players got 0 votes as those that got 1,2 and 3.

We’ll make a quick change to some code to allows us to pass a sample_probability flag for 0 vote players in the training set.

Models/knn.py
# import statements ......

def knn(neighbours, sample_probability=None):
    """
        Basic implementation of KNN algorithm to get a baseline confusion matrix for brownlow predictions.
    """
    # Load the raw training data
    features, targets = all_data_without_contested_possession_breakdown(sample_probability=sample_probability)
    
    # rest of the code .......

then a quick change to app.py

from Models.knn import knn

if __name__ == "__main__":
    knn_model, metrics = knn(neighbours=10, sample_probability=10)
    print(metrics)

Sampling Confusion Matrix

As we can see the performance of this model has improved dramatically already

TBC