Brownlow Modelling - Preparation
This is the first post in a series of posts on introductory algorithms used to power predictive models in data science. I’ve chosen to use AFL match data as my data set in an attempt to provide an approachable frame of reference for those familiar with the sport. I’ll be attempting to create a model to predict which players will get Brownlow votes for any given game of AFL football.
Predicting which players will receive votes is not easy, citation needed however I believe that the nature of the problem (discussed in detail in a later post) makes it a very interesting case study for an introductory look at how I go about solving real world problems.
The data and code used throughout this post can be found here on my Github.
Analysing the Data
As I’m the one who collected the data, I’m fairly confident that the data is accurate as can be. Two key things to note if you are looking at the data yourself.
Firstly, The 2012 data set does not split player disposals into Contested Possessions (CP) and Uncontested Possessions (UP), instead only the total number of disposals is recorded. This means that if we want to use the breakdown of CP and UP we will need to exclude 2012 data. Alternatively, we can use 2012 data if we decide to ignore the disposal breakdown for years > 2012.
Secondly, In years 2012-2017 the target variable (Brownlow Votes) is explicitly marked as a 0,1,2,3. In 2018 I have only labelled 1,2,3 and left 0 vote targets as blank. This is trivial to fix in the source data but it can also be easily handled in code (which is the method taken here).
Normally, I would write some tests the Validate the data set. Some example tests would be:
- Check that every game has 6 votes (3+2+1)
- Check that every game has exactly one player with 3,2,1 votes respectively
- Plot the range of values for each feature in the data set to look for obvious outliers
- Check that historical totals for players add up (e.g in Year X, Player Y got Z votes in total)
In this instance I’ve already run these checks and we won’t need to re-run them for every model, so covering this now is a distraction.
Usually when I’m tackling a problem like this for the first time, I have no idea what models will perform well, what level of pre-processing is optimal, or whether I want to sample training data (if it is heavily skewed - as it is here). For this reason I like to write some helper methods for myself that make it easy to explore all the different combinations and strategies without having to repeat myself.
Pandas Dataframes are invaluable when processing structured data like this. Reading from a csv is a breeze and we can do really fast filtering and operations on our data. The first thing I’m going to do is write a function that imports our csv into a pandas dataframe, removes any data we don’t need, fills in any missing values, and generates and new features we may want in our data set.
At this stage I’m going to want to add a feature for Winning_Margin to my data set as while I have the home and away team scores, I do not have the winning margin as a stored field at the player level. There are several columns which don’t have any predictive value which I’m going to want to drop for now too, things like unique game identifiers, team name, player name etc. I’ll also write a function that drops the CP/UP features from the entire dataset. Later on we’ll write another function that uses CP/UP breakdown but drops the 2012 data entirely.
The third thing I know I’ll be doing a lot is playing with whether or not I need to randomly sample data for players who polled no votes. If 44+ players take the field each match, and only 3 of them receive brownlow votes, any model that predicts 0 votes for every player is going to have a prediction accuracy of > 93%. Some algorithms perform poorly when datasets are unbalanced like that so we need a way to quickly return only a sample of data from players who got 0 votes when we require it.
# PreProcessing/load_data.py import random import pandas as pd def calculate_margin(row): ''' Calculates margin for the match and assigns it to the player based on which team they are on ''' if (row['Team'] == row['Home_Team']): return row['Home_Score'] - row['Away_Score'] else: return row['Away_Score'] - row['Home_Score'] def sample_no_votes(row, sample_probability=10): ''' The number of players receiving 0 votes vastly outnumbers the number getting 1,2 or 3. This means that models can be fairly accurate predicting 0 votes for every player. Some models perform poorly if one label outweighs the others by too much. ''' if row['Brownlow_Votes'] > 0: return True else: return True if random.randint(0, 100) < sample_probability else False def all_data_without_contested_possession_breakdown(sample_probability=None): # Read csv into Pandas DataFrame df = pd.read_csv("Brownlow Full Database.csv") # removing some empty cells at the end of the data-frame that seem to be caused by the csv. Should clean up data source at some point df = df.iloc[:, :-4] # Add a column with the winning margin for the player df['Winning_Margin'] = df.apply(lambda x: calculate_margin(x), axis=1) # Remove columns that we won't use as features in the training data df = df.drop([ 'Unique_Game_ID', 'Year', 'Round', 'Game_ID', 'Home_Team', 'Home_Score', 'Away_Team', 'Away_Score', 'Player', 'Team', 'Contested_Possessions', # values not recorded in 2012 data 'Uncontested_Possessions', # values not recorded in 2012 data ], axis=1) # Check if we want to randomly sample 0 vote players, if so perform a random sample if sample_probability is not None: df['keep'] = df.apply(lambda x: sample_no_votes(x, sample_probability), axis=1) df = df.drop(df[df.keep is False].index) df = df.drop(['keep']) # Drop the target variable from our feature set and return an array of values features = df.drop([ 'Brownlow_Votes' # this is the target ], axis=1).values # Extract the Brownlow Votes to an array of values, filling in all empty cells with a 0 targets = df['Brownlow_Votes'].fillna(0).values # in 2018 data 0 votes are just blank cells return features, targets
We’re going to be analysing a lot of different models and for each model we are also going to be trying numerous combinations of parameters and pre-processing to try and hunt for the “best performing” model. I find the easiest way to quickly assess model performance for these types of problems is with a Confusion Matrix .
Here is a Confusion Matrix I generated for a KNNClassifier model on the Brownlow Data (we’ll explore this in more detail in my next post). Note: If you’ll look closely you’ll see that the model is mostly predicting 0’s for most players in the test set.
I can generate a nice graph like this for all our models by extracting it out to a function that we can call whenever we want to generate one instantly.
# Helpers/visualisations.py import numpy as np import matplotlib.pyplot as plt plt.style.use('ggplot') from sklearn.metrics import confusion_matrix def plot_confusion_matrix(y_true, y_pred, title=None, cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ classes = ["0", "1", "2", "3"] if not title: title = 'Normalized confusion matrix' # Compute confusion matrix cm = confusion_matrix(y_true, y_pred) cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") print(cm) fig, ax = plt.subplots() im = ax.imshow(cm, interpolation='nearest', cmap=cmap) ax.figure.colorbar(im, ax=ax) # We want to show all ticks... ax.set(xticks=np.arange(cm.shape), yticks=np.arange(cm.shape), # ... and label them with the respective list entries xticklabels=classes, yticklabels=classes, title=title, ylabel='True label', xlabel='Predicted label') # Rotate the tick labels and set their alignment. plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor") # Loop over data dimensions and create text annotations. fmt = '.2f' thresh = cm.max() / 2. for i in range(cm.shape): for j in range(cm.shape): ax.text(j, i, format(cm[i, j], fmt), ha="center", va="center", color="white" if cm[i, j] > thresh else "black") fig.tight_layout() plt.show()
I think that’s a good starting point for creating some functions that will allow us to quickly analyse various models.