
IntroductionΒΆ
Regardless of who they root for, most fans of the NBA can agree that officiating is far from perfect. A seemingly winnable game can suddenly become out of reach after one or two bad calls. Early playoff exits and heartbreaking losses are often blamed on biased referees and what sour fanbases like to call "the script", just look at any social media platform after a close game.
With all of that being said, do NBA referees really deserve the criticism they receive? Are things actually rigged for certain teams? The main focus of this project is to answer some of these questions with data from previous seasons. This tutorial will walk you through the entire data science pipeline, from data collection to machine learning, and by the end of it you should have a better understanding of the factors impacting the accuracy of NBA referees. If we can use a model to predict incorrect calls based on the primary official of a game, we may even be able to hold specific refs accountable or find out who is fit for the job.
The tutorial will be separated into five main sections:
- Data Collection
- Data Processing
- Exploratory Analysis and Data Visualization
- Model Analysis, Hypothesis Testing, Machine Learning
- Insights
Data CollectionΒΆ
Before we can start exploring our topic, we need to find a dataset to work with. Since we want information related to foul calls, it makes the most sense to check the official NBA website for public data first. Luckily, the NBA has been publishing Last Two Minute Reports for every single game since the 2015-2016 season. An archive of these reports can be found here. As you may have guessed, Last Two Minute Reports assess the correctness of foul calls made within the last two minutes of a game and provide additional details such as the time left on the clock, the player who committed the foul, the player who was fouled, the type of foul called (technical, personal, defensive, blocking, etc.), and the review decision. There are four kinds of review decisions, CC = correct call, IC = incorrect call, CNC = correct non-call, and INC = incorrect non-call (a non-call is just a foul that cannot be challenged by a coach). For our purposes, an incorrect call is anything classified as an IC or INC. Some fans complain that the reports are useless as they do nothing to change game outcomes, but for data scientists, they are great! Unfortunately, the NBA does not provide the reports in .csv or spreadsheet format, so we will have to deal with that somehow.

Now, we could build a webscraper and extract all of this data ourselves, or we could look around for existing datasets. With a bit of digging, we were able to find this handy website which has already scraped foul calls from every Last Two Minute Report from March 2015 to May 2024. Go ahead and download the .csv file available on the Github website. At this point, we can import the necessary Python libraries and start coding!
ImportsΒΆ
# Import the necessary libraries
# NOTE: you may need to install the libraries first via "!pip install"
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
import statsmodels.formula.api as sm
import os
import datetime
from unicodedata import normalize
from io import StringIO
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score, roc_curve, roc_auc_score
If you are wondering what these libraries are for, here is a brief overview of the most important ones:
- Pandas is a popular open-source tool that can be used to organize and work with data in Python. The main data structure used in Pandas is called a dataframe, which you can think of as a simple group of columns and rows. Because they are easy to manipulate and understand, dataframes will be our structure of choice for the rest of the tutorial.
- NumPy is an extensive math library that we can use to perform fast operations on arrays. SciPy is similar to NumPy with greater emphasis on scientific computing.
- Matplot is a plotting library that we will use to create static visualizations like line graphs and histograms. Seaborn essentially serves the same purpose.
- The statsmodels.formula.api and Scikit-learn libraries are useful for running and evaluating machine learning algorithms or models.
Also, if you are not familiar with Jupyter notebook you can read about the platform here.
Get Foul Data from 2015-2024ΒΆ
# Load the csv into a pandas dataframe
foul_df = pd.read_csv("foul_data_2015_2024.csv", low_memory=False)
foul_df.head()
period | time | call_type | committing | disadvantaged | decision | comments | game_details | page | file | ... | committing_min | committing_team | committing_side | disadvantaged_min | disadvantaged_team | disadvantaged_side | type2 | time_min | time_sec | time2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q4 | 01:52.0 | Foul: Shooting | Josh Smith | Kevin Love | CNC | Smith (HOU) does not make contact with Love (C... | Cavaliers @ Rockets (Mar 01, 2015) | 1.0 | L2M-CLE-HOU-3-1-15.pdf | ... | 30.950000 | HOU | home | 35.016667 | CLE | away | SHOOTING | 1 | 52.0 | 1.866667 |
1 | Q4 | 01:43.0 | Foul: Shooting | JR Smith | James Harden | CC | Smith (CLE) makes contact with the body of Har... | Cavaliers @ Rockets (Mar 01, 2015) | 1.0 | L2M-CLE-HOU-3-1-15.pdf | ... | 50.800000 | CLE | away | 42.350000 | HOU | home | SHOOTING | 1 | 43.0 | 1.716667 |
2 | Q4 | 01:32.0 | Foul: Shooting | Trevor Ariza | LeBron James | CC | Ariza (HOU) makes contact with the shoulder of... | Cavaliers @ Rockets (Mar 01, 2015) | 1.0 | L2M-CLE-HOU-3-1-15.pdf | ... | 42.233333 | HOU | home | 42.100000 | CLE | away | SHOOTING | 1 | 32.0 | 1.533333 |
3 | Q4 | 01:09.0 | Foul: Loose Ball | Terrence Jones | Tristan Thompson | CC | Jones (HOU) makes contact with the arm of Thom... | Cavaliers @ Rockets (Mar 01, 2015) | 1.0 | L2M-CLE-HOU-3-1-15.pdf | ... | 44.016667 | HOU | home | 37.316667 | CLE | away | LOOSE BALL | 1 | 9.0 | 1.150000 |
4 | Q4 | 00:53.0 | Foul: Shooting | Tristan Thompson | Josh Smith | CNC | Smith (HOU) loses the ball as he goes up for t... | Cavaliers @ Rockets (Mar 01, 2015) | 1.0 | L2M-CLE-HOU-3-1-15.pdf | ... | 37.316667 | CLE | away | 30.950000 | HOU | home | SHOOTING | 0 | 53.0 | 0.883333 |
5 rows Γ 64 columns
After using the Pandas read_csv() function to create a dataframe from the foul dataset (in my case the file is named foul_data_2015_2024.csv, but you can name it whatever you like), we can take a quick look at the data with the head() function. We will get into what each column represents in the next section. Every row represents a single foul call.
Data ProcessingΒΆ
Cleaning the Foul DataΒΆ
# Create a new dataframe with selected columns
cols = [
"date", "time", "season", "home_team", "away_team",
"type", "committing", "disadvantaged", "decision",
"committing_team", "disadvantaged_team", "OFFICIAL_1"
]
foul_df_clean = foul_df[cols].copy()
# Rename columns
foul_df_clean = foul_df_clean.rename(columns={
"home_team": "home",
"away_team": "away",
"committing": "player",
"disadvantaged": "opponent",
"committing_team": "player team",
"disadvantaged_team": "opponent team",
"OFFICIAL_1": "official"
})
# Map team abbreviations to names
all_teams = {
"ATL": "Hawks", "BOS": "Celtics", "BKN": "Nets", "CHA": "Hornets", "CHI": "Bulls",
"CLE": "Cavaliers", "DAL": "Mavericks", "DEN": "Nuggets", "DET": "Pistons", "GSW": "Warriors",
"HOU": "Rockets", "IND": "Pacers", "LAC": "Clippers", "LAL": "Lakers", "MEM": "Grizzlies",
"MIA": "Heat", "MIL": "Bucks", "MIN": "Timberwolves", "NOP": "Pelicans", "NYK": "Knicks",
"OKC": "Thunder", "ORL": "Magic", "PHI": "76ers", "PHX": "Suns", "POR": "Trail Blazers",
"SAC": "Kings", "SAS": "Spurs", "TOR": "Raptors", "TOT": "Total", "UTA": "Jazz", "WAS": "Wizards"
}
# Map team abbreviations to their names for the player and opponent team columns
foul_df_clean["player team"] = foul_df_clean["player team"].map(all_teams)
foul_df_clean["opponent team"] = foul_df_clean["opponent team"].map(all_teams)
foul_df_clean.head()
date | time | season | home | away | type | player | opponent | decision | player team | opponent team | official | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-03-01 | 01:52.0 | 2015 | Rockets | Cavaliers | SHOOTING | Josh Smith | Kevin Love | CNC | Rockets | Cavaliers | Dan Crawford |
1 | 2015-03-01 | 01:43.0 | 2015 | Rockets | Cavaliers | SHOOTING | JR Smith | James Harden | CC | Cavaliers | Rockets | Dan Crawford |
2 | 2015-03-01 | 01:32.0 | 2015 | Rockets | Cavaliers | SHOOTING | Trevor Ariza | LeBron James | CC | Rockets | Cavaliers | Dan Crawford |
3 | 2015-03-01 | 01:09.0 | 2015 | Rockets | Cavaliers | LOOSE BALL | Terrence Jones | Tristan Thompson | CC | Rockets | Cavaliers | Dan Crawford |
4 | 2015-03-01 | 00:53.0 | 2015 | Rockets | Cavaliers | SHOOTING | Tristan Thompson | Josh Smith | CNC | Cavaliers | Rockets | Dan Crawford |
While it is nice that the dataset has many different features to choose from, we only need a select few for analysis and visualization. Of the 50+ available columns, we will use the following:
- Date = self explanatory.
- Time = the amount of time left on the clock when the foul was called, in MINUTES:SECONDS.MILLISECONDS format.
- Season = the year.
- Home team = the name of the home team.
- Away team = the name of the away team.
- Type = the type of foul called.
- Committing = the name of the player who committed the foul.
- Disadvantaged = the name of the player who was fouled.
- Decision = the review decision for the foul, CC, CNC, IC, or INC.
- Committing team = the team of the player who committed the foul.
- Disadvantaged team = the team of the player who got fouled.
- Official 1 = the primary referee of the game in which the foul was called.
In addition to filtering out these columns, we will rename some of them so they are a bit easier to work with and understand. Lastly we will use the map() function to change the player and opponent team columns so they match the home and away team columns. If you take a look at the player/opponent team and home/away team columns before the modification you will notice that their values do not follow the same format. The player/opponent teams use abbreviations instead of full names ("NYK" instead of "Knicks" for example).
If there is any missing data (NaNs) in a row, we will drop it altogether. After cleaning up the dataframe, displaying foul_df_clean should look like the above block.
Remove NaN Observations and Add a New Column to Assess Correct and Incorrect CallsΒΆ
# Remove rows where the decision column is NaN
foul_df_clean = foul_df_clean[~foul_df_clean["decision"].isna()].reset_index(drop=True)
def correct_decision(decision):
if decision in ["CC", "CNC"]:
return 1
else:
return 0
# Create a column that serves as a binary indicator for whether the referees acted correctly or incorrectly
foul_df_clean["correct decision"] = foul_df_clean["decision"].apply(correct_decision)
foul_df_clean.head()
date | time | season | home | away | type | player | opponent | decision | player team | opponent team | official | correct decision | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-03-01 | 01:52.0 | 2015 | Rockets | Cavaliers | SHOOTING | Josh Smith | Kevin Love | CNC | Rockets | Cavaliers | Dan Crawford | 1 |
1 | 2015-03-01 | 01:43.0 | 2015 | Rockets | Cavaliers | SHOOTING | JR Smith | James Harden | CC | Cavaliers | Rockets | Dan Crawford | 1 |
2 | 2015-03-01 | 01:32.0 | 2015 | Rockets | Cavaliers | SHOOTING | Trevor Ariza | LeBron James | CC | Rockets | Cavaliers | Dan Crawford | 1 |
3 | 2015-03-01 | 01:09.0 | 2015 | Rockets | Cavaliers | LOOSE BALL | Terrence Jones | Tristan Thompson | CC | Rockets | Cavaliers | Dan Crawford | 1 |
4 | 2015-03-01 | 00:53.0 | 2015 | Rockets | Cavaliers | SHOOTING | Tristan Thompson | Josh Smith | CNC | Cavaliers | Rockets | Dan Crawford | 1 |
We previously mentioned that incorrect calls include both IC and INC decisions. To make things easier later on, we will create another column, "correct decision", to numerically indicate the correctness of each foul call. Since a foul call can only be wrong or right, we will use ones for correct calls (when decision == CNC or CC) and zeros for incorrect calls (when decision == INC or IC).
In the next code block we will also use the time column to create a new column containing the seconds remaining in a game.
Create a New Column for Seconds RemainingΒΆ
# Splitting time into minutes and seconds
foul_df_clean[["min", "sec"]] = foul_df_clean["time"].str.split(":", expand=True)
foul_df_clean["min"] = foul_df_clean["min"].astype(int)
foul_df_clean["sec"] = foul_df_clean["sec"].astype(float)
# Calculate total time in seconds
foul_df_clean["seconds"] = foul_df_clean["min"] * 60 + foul_df_clean["sec"]
foul_df_clean.head()
date | time | season | home | away | type | player | opponent | decision | player team | opponent team | official | correct decision | min | sec | seconds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-03-01 | 01:52.0 | 2015 | Rockets | Cavaliers | SHOOTING | Josh Smith | Kevin Love | CNC | Rockets | Cavaliers | Dan Crawford | 1 | 1 | 52.0 | 112.0 |
1 | 2015-03-01 | 01:43.0 | 2015 | Rockets | Cavaliers | SHOOTING | JR Smith | James Harden | CC | Cavaliers | Rockets | Dan Crawford | 1 | 1 | 43.0 | 103.0 |
2 | 2015-03-01 | 01:32.0 | 2015 | Rockets | Cavaliers | SHOOTING | Trevor Ariza | LeBron James | CC | Rockets | Cavaliers | Dan Crawford | 1 | 1 | 32.0 | 92.0 |
3 | 2015-03-01 | 01:09.0 | 2015 | Rockets | Cavaliers | LOOSE BALL | Terrence Jones | Tristan Thompson | CC | Rockets | Cavaliers | Dan Crawford | 1 | 1 | 9.0 | 69.0 |
4 | 2015-03-01 | 00:53.0 | 2015 | Rockets | Cavaliers | SHOOTING | Tristan Thompson | Josh Smith | CNC | Cavaliers | Rockets | Dan Crawford | 1 | 0 | 53.0 | 53.0 |
Exploratory Analysis and Data VisualizationΒΆ
In this section, we are going to create a few different plots with our cleaned dataset and try to find some possible relationships between the variables. Feel free to recreate this part and try different graphs, the point of this stage is to experiment a bit!
For the first graph, let's see how referees have performed over time by plotting the average foul call accuracy for each season. After grouping rows together by season using groupby(), the mean() function will allow us to quickly calculate the average value of the "correct decision" column for each group. This is basically the same as finding: $$\frac{\textrm{\# of correct decisions}}{\textrm{total decisions}}$$
After plotting with the Matplot library, you will notice that referees have *generally* gotten better over time. We will keep this in mind moving forward.
Plot Accuracy of Foul Calls Over TimeΒΆ
# Group by season and calculate the mean accuracy of decisions for each season
season_accuracy = foul_df_clean.groupby("season")["correct decision"].mean()
# Plotting
plt.figure(figsize=(10, 6))
season_accuracy.plot(marker="o", linestyle="-")
plt.title("Accuracy of Foul Calls Over Seasons")
plt.xlabel("Season")
plt.ylabel("Mean Call Accuracy")
plt.xticks(season_accuracy.index)
plt.grid(True)
plt.tight_layout()
plt.show()
To get a better idea of how biased (or unbiased) referees are, let's count the number of incorrect calls against each team. We can tell if a foul call puts a team at a disadvantage if the decision is incorrect, and if the player who got called for the foul is on that team. In the following code block, we will use nested Python dictionaries to map team names to the number of incorrect calls against them for every season.
Count the Number of Incorrect Calls Against Each Team for Each SeasonΒΆ
# Get the number of unique team names
teams = foul_df_clean["home"].unique()
# Dictionary to store the results
team_season_incorrect_counts = {}
for team in teams:
# Initialize a dictionary for the current team
team_season_incorrect_counts[team] = {}
for season in foul_df_clean["season"].unique():
# Filter data for the current team and season
team_season_data = foul_df_clean[(foul_df_clean["player team"] == team) & (foul_df_clean["season"] == season)]
# Count observations where correct decision is 0 (incorrect)
count_incorrect = len(team_season_data[(team_season_data["correct decision"] == 0)])
# Store the count for the current season and move onto the next season
team_season_incorrect_counts[team][season] = count_incorrect
counter = 0
for team, season_counts in team_season_incorrect_counts.items():
counter += 1
if counter == 5:
break
print(f"Team: {team}")
for season, count in season_counts.items():
print(f"{season} Season: {count} incorrect calls")
Team: Rockets 2015 Season: 4 incorrect calls 2016 Season: 45 incorrect calls 2017 Season: 26 incorrect calls 2018 Season: 12 incorrect calls 2019 Season: 43 incorrect calls 2020 Season: 10 incorrect calls 2021 Season: 7 incorrect calls 2022 Season: 10 incorrect calls 2023 Season: 14 incorrect calls 2024 Season: 25 incorrect calls Team: Celtics 2015 Season: 7 incorrect calls 2016 Season: 13 incorrect calls 2017 Season: 28 incorrect calls 2018 Season: 18 incorrect calls 2019 Season: 24 incorrect calls 2020 Season: 16 incorrect calls 2021 Season: 21 incorrect calls 2022 Season: 30 incorrect calls 2023 Season: 31 incorrect calls 2024 Season: 9 incorrect calls Team: Nuggets 2015 Season: 10 incorrect calls 2016 Season: 24 incorrect calls 2017 Season: 14 incorrect calls 2018 Season: 22 incorrect calls 2019 Season: 29 incorrect calls 2020 Season: 26 incorrect calls 2021 Season: 15 incorrect calls 2022 Season: 19 incorrect calls 2023 Season: 20 incorrect calls 2024 Season: 18 incorrect calls Team: Nets 2015 Season: 14 incorrect calls 2016 Season: 19 incorrect calls 2017 Season: 17 incorrect calls 2018 Season: 18 incorrect calls 2019 Season: 23 incorrect calls 2020 Season: 23 incorrect calls 2021 Season: 17 incorrect calls 2022 Season: 27 incorrect calls 2023 Season: 16 incorrect calls 2024 Season: 11 incorrect calls
We might also want to know how many favorable calls each team gets every season. This is similar to what we did in the previous code block with some minor tweaks. In this case, the player who gets an incorrect foul call (and gains possession of the ball or gets to shoot free throws as a result) is considered favored. When it comes to implementing this, we will increment a team's favorable call count when the decision is incorrect and the player who got fouled, AKA the opponent, is on the current team.
Count the Number of Incorrect Calls That Favor Each Team for Each SeasonΒΆ
# Dictionary to store the results
team_season_favorable_counts = {}
for team in teams:
team_season_favorable_counts[team] = {}
for season in foul_df_clean["season"].unique():
# Simple change: player team -> opponent team
# We want to find observations where players on the current team were
# given incorrect foul calls (which put the other team at a disadvantage)
team_season_data = foul_df_clean[(foul_df_clean["opponent team"] == team) & (foul_df_clean["season"] == season)]
# Count observations where correct decision is 0 (incorrect)
count_favorable = len(team_season_data[(team_season_data["correct decision"] == 0)])
# Store the count for the current season and move onto the next season
team_season_favorable_counts[team][season] = count_favorable
counter = 0
for team, season_counts in team_season_favorable_counts.items():
counter += 1
if counter == 5:
break
print(f"Team: {team}")
for season, count in season_counts.items():
print(f"{season} Season: {count} favorable incorrect calls")
Team: Rockets 2015 Season: 6 favorable incorrect calls 2016 Season: 22 favorable incorrect calls 2017 Season: 16 favorable incorrect calls 2018 Season: 14 favorable incorrect calls 2019 Season: 25 favorable incorrect calls 2020 Season: 21 favorable incorrect calls 2021 Season: 6 favorable incorrect calls 2022 Season: 15 favorable incorrect calls 2023 Season: 14 favorable incorrect calls 2024 Season: 28 favorable incorrect calls Team: Celtics 2015 Season: 5 favorable incorrect calls 2016 Season: 22 favorable incorrect calls 2017 Season: 12 favorable incorrect calls 2018 Season: 17 favorable incorrect calls 2019 Season: 20 favorable incorrect calls 2020 Season: 21 favorable incorrect calls 2021 Season: 12 favorable incorrect calls 2022 Season: 23 favorable incorrect calls 2023 Season: 31 favorable incorrect calls 2024 Season: 15 favorable incorrect calls Team: Nuggets 2015 Season: 8 favorable incorrect calls 2016 Season: 18 favorable incorrect calls 2017 Season: 10 favorable incorrect calls 2018 Season: 27 favorable incorrect calls 2019 Season: 30 favorable incorrect calls 2020 Season: 29 favorable incorrect calls 2021 Season: 30 favorable incorrect calls 2022 Season: 36 favorable incorrect calls 2023 Season: 39 favorable incorrect calls 2024 Season: 11 favorable incorrect calls Team: Nets 2015 Season: 6 favorable incorrect calls 2016 Season: 12 favorable incorrect calls 2017 Season: 14 favorable incorrect calls 2018 Season: 30 favorable incorrect calls 2019 Season: 29 favorable incorrect calls 2020 Season: 24 favorable incorrect calls 2021 Season: 12 favorable incorrect calls 2022 Season: 20 favorable incorrect calls 2023 Season: 25 favorable incorrect calls 2024 Season: 11 favorable incorrect calls
We may also expect the home and away teams to be treated differently every game, so let's go ahead and repeat what we did in the previous two blocks but for the home and away columns.
Alright, this might seem like a lot, but we are almost at the finish line!
Repeat the Previous Two Steps for Home and Away TeamsΒΆ
# Dictionary to store the results
home_away_counts = {"Home":{},"Away":{}}
for season in foul_df_clean["season"].unique():
home_season_inc = foul_df_clean[(foul_df_clean["player team"] == foul_df_clean["home"]) & (foul_df_clean["season"] == season)]
away_season_inc = foul_df_clean[(foul_df_clean["player team"] == foul_df_clean["away"]) & (foul_df_clean["season"] == season)]
home_season_fav = foul_df_clean[(foul_df_clean["opponent team"] == foul_df_clean["home"]) & (foul_df_clean["season"] == season)]
away_season_fav = foul_df_clean[(foul_df_clean["opponent team"] == foul_df_clean["away"]) & (foul_df_clean["season"] == season)]
# Count observations where correct decision is 0 (incorrect) for the home team
count_incorrect_home = len(home_season_inc[(home_season_inc["correct decision"] == 0)])
# Do the same for the away team
count_incorrect_away = len(away_season_inc[(away_season_inc["correct decision"] == 0)])
# Count observations where correct decision is 0 (incorrect) but favors the home team
count_favorable_home = len(home_season_fav[(home_season_fav["correct decision"] == 0)])
# Do the same for the away team
count_favorable_away = len(away_season_fav[(away_season_fav["correct decision"] == 0)])
# Store the counts for the current season and move onto the next season
# The counts are represented as a tuple (# incorrect, # favorable incorrect)
home_away_counts["Home"][season] = (count_incorrect_home, count_favorable_home)
home_away_counts["Away"][season] = (count_incorrect_away, count_favorable_away)
for team, season_counts in home_away_counts.items():
print(f"{team}")
for season, (inc_count, fav_count) in season_counts.items():
print(f"{season} Season: {inc_count} incorrect calls, {fav_count} favorable calls")
Home 2015 Season: 127 incorrect calls, 87 favorable calls 2016 Season: 388 incorrect calls, 264 favorable calls 2017 Season: 267 incorrect calls, 209 favorable calls 2018 Season: 313 incorrect calls, 228 favorable calls 2019 Season: 371 incorrect calls, 325 favorable calls 2020 Season: 259 incorrect calls, 293 favorable calls 2021 Season: 235 incorrect calls, 244 favorable calls 2022 Season: 349 incorrect calls, 322 favorable calls 2023 Season: 365 incorrect calls, 334 favorable calls 2024 Season: 227 incorrect calls, 254 favorable calls Away 2015 Season: 125 incorrect calls, 95 favorable calls 2016 Season: 369 incorrect calls, 304 favorable calls 2017 Season: 320 incorrect calls, 185 favorable calls 2018 Season: 280 incorrect calls, 248 favorable calls 2019 Season: 351 incorrect calls, 343 favorable calls 2020 Season: 296 incorrect calls, 255 favorable calls 2021 Season: 248 incorrect calls, 228 favorable calls 2022 Season: 328 incorrect calls, 346 favorable calls 2023 Season: 338 incorrect calls, 365 favorable calls 2024 Season: 254 incorrect calls, 226 favorable calls
Plot the DataΒΆ
Okay, now that we have counted up calls for every team across every season, we can use our data to plot incorrect and favorable call counts over time. We can also calculate the league average number of incorrect and favorable calls for every season using the mean() function once again. Follow the code below and take a look at the results, you might notice something interesting!
Interestingly enough, there seems to be some significant team-level differences in call frequency. However, these trends shift quite a bit season-to-season. It may be hard to see, but take a look at the Nuggets number of favorable calls relative to the mean line. In 2023, the year they won the finals, the Nuggets were actually far above average in terms of their favorable calls. In contrast, the Warriors fell below the average number of favorable calls in 2022 when they won the championship over the Celtics. Thus we might not be able to make the generalization that some teams receive better treatment from the refs than other teams, but we can keep exploring other factors! If you want to feel better about the team you support, see how they compare to the rest of the league by making some minor changes to the code below.
# Create a dataframe from the team_season_incorrect_counts dictionary and transpose
team_season_incorrect_df = pd.DataFrame(team_season_incorrect_counts)
team_season_incorrect_df = team_season_incorrect_df.T
# Create a dataframe from the team_season_favorable_counts dictionary and transpose
team_season_favorable_df = pd.DataFrame(team_season_favorable_counts)
team_season_favorable_df = team_season_favorable_df.T
# Calculate the average number of incorrect and favorable incorrect calls per season
avg_incorrect_calls = team_season_incorrect_df.mean(axis=0)
avg_favorable_calls = team_season_favorable_df.mean(axis=0)
# Plotting
fig, axs = plt.subplots(2, 1, figsize=(12, 14))
# Plot for number of incorrect calls against each team over seasons
for team in team_season_incorrect_df.index:
axs[0].plot(team_season_incorrect_df.columns, team_season_incorrect_df.loc[team], label=team)
# Plot the average line for incorrect calls
axs[0].plot(team_season_incorrect_df.columns, avg_incorrect_calls, label="Average", color="black", linewidth=2, linestyle="--")
axs[0].set_title("Number of Incorrect Calls Against Each Team Over Seasons")
axs[0].set_xlabel("Season")
axs[0].set_ylabel("Number of Incorrect Calls")
axs[0].grid(True)
axs[0].legend(loc="center left", bbox_to_anchor=(1, 0.5))
# Plot for number of favorable incorrect calls for each team over seasons
for team in team_season_favorable_df.index:
axs[1].plot(team_season_favorable_df.columns, team_season_favorable_df.loc[team], label=team)
# Plot the average line for favorable calls
axs[1].plot(team_season_favorable_df.columns, avg_favorable_calls, label="Average", color="black", linewidth=2, linestyle="--")
axs[1].set_title("Number of Favorable Incorrect Calls For Each Team Over Seasons")
axs[1].set_xlabel("Season")
axs[1].set_ylabel("Number of Favorable Incorrect Calls")
axs[1].grid(True)
axs[1].legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()
Since we already counted calls for the home and away team columns, we can plot them as well. Overall, the away team seems to be favored in most seasons: 2015-2016, 2018-2019, 2022-2023.
# Get number of favorable and incorrect calls by season for home and away teams
seasons = sorted(list(home_away_counts["Home"].keys()))
home_incorrect_counts = [home_away_counts["Home"][season][0] for season in seasons]
away_incorrect_counts = [home_away_counts["Away"][season][0] for season in seasons]
home_favorable_counts = [home_away_counts["Home"][season][1] for season in seasons]
away_favorable_counts = [home_away_counts["Away"][season][1] for season in seasons]
# Plotting
fig, axs = plt.subplots(2, 1, figsize=(12, 10))
# Plot for number of incorrect calls over seasons for the home and away teams
axs[0].plot(seasons, home_incorrect_counts, label="Home")
axs[0].plot(seasons, away_incorrect_counts, label="Away")
axs[0].set_title("Number of Incorrect Calls Over Seasons for Home and Away Teams")
axs[0].set_xlabel("Season")
axs[0].set_ylabel("Number of Incorrect Calls")
axs[0].legend()
axs[0].grid(True)
# Plot for number of favorable calls over seasons for the home and away teams
axs[1].plot(seasons, home_favorable_counts, label="Home")
axs[1].plot(seasons, away_favorable_counts, label="Away")
axs[1].set_title("Number of Favorable Calls Over Seasons for Home and Away Teams")
axs[1].set_xlabel("Season")
axs[1].set_ylabel("Number of Favorable Calls")
axs[1].legend()
axs[1].grid(True)
plt.tight_layout()
plt.show()
Of course, not every referee is the same when it comes to accuracy, they aren't robots! To visualize this, we can count and plot the number of incorrect calls for every unique official listed in the dataset. This is not enough though, because some refs have not officiated as many games and therefore are less likely to get the same number of calls wrong as the more experienced referees. To combat this, we can also keep track of how many standard deviations away each referee is from the average accuracy (proportion of correct calls to total calls). Using this method will allow us to level the playing field a bit and analyze the performance of each referee. You might notice that there is quite a bit of variability in accuracy across officials. Interesting!
Plot the Number of Incorrect Calls and Call Accuracy by OfficialΒΆ
# Group by official and count the total number of calls made by each official
total_calls_by_official = foul_df_clean.groupby("official").size()
# Then get all of the incorrect calls
incorrect_calls_df = foul_df_clean[foul_df_clean["correct decision"] == 0]
# Group by official and count their incorrect calls
incorrect_calls_by_official = incorrect_calls_df.groupby("official").size()
# Calculate accuracy (proportion of incorrect calls)
accuracy_by_official = (total_calls_by_official - incorrect_calls_by_official) / total_calls_by_official
# Calculate mean call accuracy and standard deviation
mean_accuracy = accuracy_by_official.mean()
std_accuracy = accuracy_by_official.std()
# Calculate number of standard deviations from the mean (z-score) for each official
z_score = (accuracy_by_official - mean_accuracy) / std_accuracy
# Sort officials based on total incorrect calls
total_calls_by_official = total_calls_by_official.sort_values()
# Sort officials based on number of standard deviations from the mean accuracy
z_score = z_score.sort_values()
# Create a figure and two subplots
fig, axes = plt.subplots(2, 1, figsize=(10, 12))
# Plot total incorrect calls by official
axes[0].bar(total_calls_by_official.index, total_calls_by_official, color="lightblue")
axes[0].set_title("Total Incorrect Calls by Official")
axes[0].set_xlabel("Official")
axes[0].set_ylabel("Number of Incorrect Calls")
axes[0].tick_params(axis="x", rotation=90)
axes[0].tick_params(axis="x", labelsize=6) # Adjust label size
# Plot number of standard deviations from mean accuracy by official
axes[1].bar(z_score.index, z_score, color="orange")
axes[1].axhline(0, color="gray", linestyle="--", linewidth=0.5) # Add horizontal line at y=0
axes[1].set_title("Number of Standard Deviations from Mean Accuracy by Official")
axes[1].set_xlabel("Official")
axes[1].set_ylabel("Standard Deviations from Mean Accuracy")
axes[1].tick_params(axis="x", rotation=90)
axes[1].tick_params(axis="x", labelsize=6) # Adjust label size
plt.tight_layout()
plt.show()
The last two plots in this section are much easier to complete. To plot the frequency of incorrect calls as the number of seconds remaining decreases, we can just remove observations where the number of seconds is above 120, or 2 minutes. To plot the frequency of incorrect calls depending on the type of foul called, we can use the groupby() and mean() functions.
Based on the charts, it seems like referees tend to get more calls wrong as game time decreases. This can be especially concerning for close games! Also, the correctness of calls appears to shift significantly across foul types, perhaps due to some calls being more common than others.
Plot the Number of Incorrect Calls Over Remaining Game TimeΒΆ
# Filter out rows in the last two minutes
l2m = foul_df_clean[foul_df_clean["seconds"] <= 120]
# Count the number of incorrect calls in the last two minutes
incorrect_calls_l2m = l2m[l2m["correct decision"] == 0]
# Plotting
plt.figure(figsize=(8, 6))
incorrect_calls_l2m["seconds"].hist(bins=24, color="skyblue", edgecolor="black")
plt.xlabel("Seconds Remaining")
plt.ylabel("Number of Incorrect Calls")
plt.title("Number of Incorrect Calls in the Last Two Minutes")
plt.grid(axis="y")
plt.show()
Plot the Accuracy of Calls for Each Type of Foul CallΒΆ
# Calculate accuracy for each type of call
accuracy_by_type = foul_df_clean.groupby("type")["correct decision"].mean().sort_values()
# Plot the accuracy for each type of call
plt.figure(figsize=(10, 6))
sns.barplot(x=accuracy_by_type.index,y=accuracy_by_type.values)
plt.title("Accuracy of Calls by Type")
plt.xlabel("Type of Call")
plt.ylabel("Accuracy")
plt.tick_params(axis="x", rotation=90, labelsize=6)
plt.show()
With all of the exploration complete, I think we are ready to start coming up with some possible hypotheses.
Model Analysis, Hypothesis Testing, Machine LearningΒΆ
At this stage of the data science pipeline, we will use a few well-known modeling and machine learning techniques to see how valid our assumptions are. To be more specific, we will train logistic regression and random forest models on our cleaned data and see how useful certain variables like official and remaining game time are for predicting incorrect foul calls. We will explain the different models later on, but first we have to prepare our dataset for training.
We will use the same independent (predictors) and dependent (response) variables for both models. Independent: season, seconds remaining, foul type, official. Dependent: correct decision.
If you try to split the dataset and train a model with it as-is, you will likely face some errors. This is because official names and foul types are represented as strings, which you cannot perform mathematical operations on! Fixing this is pretty simple with one-hot encoding, a method that converts categorical variables into unique sequences of 1s and 0s. For example, we can map the values [Cat, Dog, Fish] to [[1,0,0],[0,1,0 ],[0,0,1]]. If you want to learn more about one-hot encoding, you can read about it here. In Python, all we have to do is run the Pandas get_dummies() function to one hot encode the season, type, and official columns. Even though seasons are technically in numeric format, we will view them as categorical values.
While we are at it, we can standardize the seconds column by calculating the mean and standard deviation, and using the formula: $$Z = \frac{x - \textrm{mean}}{\textrm{standard deviation}}$$ Standardization transforms the column so it has a mean of 0 and a standard deviation of 1. The objective of this is to fit the distribution of values to a standard normal distribution and reduce sensitivity to outliers (since the number of seconds remaining can vary quite a bit).
Preprocess DataΒΆ
# Create a new dataset which includes the features we want to test on
# Our independent variables (predictors) are season, seconds remaining, type of foul called, and the official
# Our dependent variable (response) is the correctness of the decision
foul_df_final = foul_df_clean[["season", "seconds", "type", "official", "correct decision"]].copy()
# Standardize the seconds column
mean_seconds = np.mean(foul_df_final["seconds"])
std_seconds = np.std(foul_df_final["seconds"])
foul_df_final["seconds"] = (foul_df_final["seconds"] - mean_seconds) / std_seconds
# One hot encode seasons and official names
foul_df_final = pd.get_dummies(foul_df_final, columns=["season", "type", "official"], prefix=["season", "type", "official"])
foul_df_final.head()
seconds | correct decision | season_2015 | season_2016 | season_2017 | season_2018 | season_2019 | season_2020 | season_2021 | season_2022 | ... | official_Suyash Mehta | official_Tom Washington | official_Tony Brothers | official_Tre Maddox | official_Tyler Ford | official_Tyler Mirkovich | official_Tyler Ricks | official_Violet Palmer | official_Vladimir Voyard-Tadal | official_Zach Zarba | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.452359 | 1 | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | 1.230863 | 1 | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | 0.960147 | 1 | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | 0.394103 | 1 | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | 0.000334 | 1 | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
5 rows Γ 179 columns
Checking for and Handling Target Class ImbalanceΒΆ
# Count the occurrences of each value in the correct decision column
correct_decision_counts = foul_df_final["correct decision"].value_counts()
print(correct_decision_counts)
correct decision 1 74543 0 5837 Name: count, dtype: int64
Oh no, it looks like our dataset is imbalanced! Class imbalance occurs when the minority class, which is "correct decision == 0" in this case, appears far less than the majority class. There are over 12 times as many correct decisions as there are incorrect decisions. There are a few ways to deal with this, but the easiest solution is to pass a certain parameter when running our machine learning models (more on this soon). Before that, we can split our data into training and test sets with the train_test_split function.
Split Data into Train and Test SplitsΒΆ
# Prepare the data for a train test split
# Our predictors are season and seconds remaining,
# Our response is a correct decision (1) or incorrect decision (0)
X = foul_df_final.drop(columns=["correct decision"]) # All columns except the target column
y = foul_df_final["correct decision"] # Target column
# Test size = the proportion of the dataset to include in the test split
# Random state = controls the shuffling of the data before the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Logistic regression models are used for predicting binary outcomes like incorrect and correct foul calls. It uses something known as the sigmoid function to produce a probability between 0 and 1 using our independent variables and calculated weights. This probability is used to make predictions based on a given threshold, typically 0.5. All we have to do is provide the LogisticRegression model with a number of iterations and a parameter, class_weight = "balanced", which indicates that our data is class-imbalanced. In the background, the model will use our training data to find the coefficients for our independent variables which best estimate the training outputs. Once the model is trained, we can assess its performance by making predictions on the test set with the predict() function. We can also see what the most impactful variables are by printing the top 5 largest coefficients. From this we can gather that SUPPORT RULING, OVERTURN RULING, and PERSONAL TAKE foul types have large impacts on results with small changes in data.
Logistic Regression: Predicting/Classifying Incorrect CallsΒΆ
# Initialize and train the model
model = LogisticRegression(max_iter=2000, class_weight="balanced")
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Get the coefficients for the predictors
coefficients = model.coef_[0]
top_indices = np.argsort(np.abs(coefficients))[::-1][:5]
for index in top_indices:
feature = X_train.columns[index]
coefficient = coefficients[index]
print(f"Feature: {feature}, Coefficient: {coefficient}")
Feature: type_SUPPORT RULING, Coefficient: 4.351317369954331 Feature: type_OVERTURN RULING, Coefficient: 3.5367877183853538 Feature: type_PERSONAL TAKE, Coefficient: 3.358119584102903 Feature: type_OUT OF BOUNDS - BAD PASS TURN, Coefficient: 2.58783960803185 Feature: official_Sha'Rae Mitchell, Coefficient: 2.2084677967373034
To evaluate the model's performance, we can use the following metrics:
- Accuracy = the overall percentage of correct predictions.
- Precision = the percentage of positive predictions that are actually correct.
- Recall = the percentage of actual positives that are correctly identified.
- F1 = the harmonic mean of precision and recall.
For visualization, we can plot a confusion matrix, which is a table showing the number of true/false positives and true/false negatives. A true positive is a correct prediction of the positive class, a false positive is an incorrect prediction of the positive class, a true negative is a correct prediction of the negative class, and a false negative is an incorrect prediction of the negative class. Our logistic regression seems to have a massive number of true positives compared to everything else, mostly because there are way more positive examples than negative examples in our dataset. To further reduce the effect of class imbalance, we might want to reduce the number of positive examples in our dataset or use some alternative methods like oversampling.
Finally, we can plot a ROC curve that shows the tradeoff between the model's True Positive and False Positive Rates (TPR and FPR). Without going into too much detail, a curve that is closer to the top-left corner is generally considered to be better than a diagonal line. If the curve is a diagonal line, it means our model predictions are no better than random guesses. The area under the curve, or AUC, summarizes the performance of our model in a single value between 0 and 1. An AUC of 1 indicates a perfect model, and anything below 0.5 is worthless. Thankfully, we achieved an AUC of 0.69. Not perfect by any means, but not meaningless either! For more information on these metrics, go here.
Testing our ModelΒΆ
# Evaluate the model using a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot confusion matrix as a heatmap
plt.figure(figsize=(4, 2))
sns.heatmap(conf_matrix, annot=True, cmap="Blues", fmt="g", cbar=False)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Calculate precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)
# Calculate recall
recall = recall_score(y_test, y_pred)
print("Recall:", recall)
# Calculate F1-score
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)
Accuracy: 0.7515551132122419 Precision: 0.9500537501033656 Recall: 0.7721621076685261 F1 Score: 0.8519205101586831
Plotting the ROC CurveΒΆ
# Get predicted probabilities for the positive class
y_probs = model.predict_proba(X_test)[:, 1]
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_probs, pos_label=1)
# Calculate AUC (Area Under the Curve)
auc = roc_auc_score(y_test, y_probs)
# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color="blue", label=f"ROC Curve (AUC = {auc:.2f})")
plt.plot([0, 1], [0, 1], color="gray", linestyle="--", lw=1)
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
Let's move on and compare our logistic regression to a random forest model. The random forest classifier is an ensemble method that combines the predictions of multiple decision trees to produce a more accurate prediction than any decision tree on its own. Usually, the final prediction is made by taking the majority vote of the trees. Decision trees split input data into branches based on feature/predictor values, and come up with predictions at the leaf level. The variable used to split the data at any given node is determined through random feature selection, a process that randomly selects a subset of predictors (like official and seconds, or season and call type). You can learn about the model in more detail here.
After calculating the same metrics, we can see a few things:
- The F1 score is higher than the logistic regression's. This means the random forest model is better at predicting correct foul calls.
- The AUC is slightly lower than the logistic regression's. This means the random forest model is slightly worse at distinguishing between incorrect and correct foul calls.
- The most important feature is not a specific call type, it is the number of seconds left in a game.
Comparing the Results to a Random Forest ClassifierΒΆ
# Initialize and train the Random Forest model
model_rf = RandomForestClassifier(random_state=42, class_weight="balanced")
model_rf.fit(X_train, y_train)
# Predictions
y_pred_rf = model_rf.predict(X_test)
# Evaluate the model using a confusion matrix
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
# Plot confusion matrix as a heatmap
plt.figure(figsize=(4, 2))
sns.heatmap(conf_matrix_rf, annot=True, cmap="Blues", fmt="g", cbar=False)
plt.title("Confusion Matrix (Random Forest)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
# Calculate accuracy, precision, recall, and F1-score
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
print("Accuracy (Random Forest):", accuracy_rf)
print("Precision (Random Forest):", precision_rf)
print("Recall (Random Forest):", recall_rf)
print("F1 Score (Random Forest):", f1_rf)
# Get predicted probabilities for the positive class
y_probs_rf = model_rf.predict_proba(X_test)[:, 1]
# Calculate ROC curve and AUC
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, y_probs_rf, pos_label=1)
auc_rf = roc_auc_score(y_test, y_probs_rf)
# Plot ROC curve
plt.figure(figsize=(6, 4))
plt.plot(fpr_rf, tpr_rf, color="blue", label=f"ROC Curve (AUC = {auc_rf:.2f})")
plt.plot([0, 1], [0, 1], color="gray", linestyle="--", lw=1)
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
# Get the feature importances
feature_importances_rf = model_rf.feature_importances_
# Sort the features based on their importance
sorted_indices_rf = np.argsort(feature_importances_rf)[::-1]
top_features_rf = X_train.columns[sorted_indices_rf][:5]
top_importances_rf = feature_importances_rf[sorted_indices_rf][:5]
# Plot the top 5 features
plt.figure(figsize=(6, 4))
plt.barh(top_features_rf, top_importances_rf, color="skyblue")
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Top 5 Feature Importances")
plt.gca().invert_yaxis()
plt.show()
Accuracy (Random Forest): 0.8903956208011943 Precision (Random Forest): 0.9306586118589533 Recall (Random Forest): 0.9525505746353922 F1 Score (Random Forest): 0.9414773482130995
InsightΒΆ

This part of the data lifecycle is where we try to draw conclusions from our analysis and modeling. So, what have we learned here? Well, we can say a few things:
- Since 2015, referees have gotten better at making foul calls at the end of games.
- Some teams get more favorable calls than others, but this is not consistent across seasons.
- Accuracy among officials can vary drastically.
- Certain foul types get incorrectly called a lot more than others, like double technical fouls. This makes sense as technical fouls are quite subjective and can be swayed by the emotions of a referee.
- Foul calls get increasingly inaccurate as time expires. Time appears to be the most important feature when it comes to predicting the correctness of a foul call. Overall, it is safe to say that referees are not as bad as we might say they are. Although they get calls wrong, there are many factors outside of their control which make it difficult to not make mistakes. With that in mind, referees may still have biases that cannot be found in our dataset. For example, officials could treat star players better than benchwarmers, but there is no way of knowing that without having a player dataset on hand as well. The answers to our questions are still unclear, but we hope that you can take what you learned from this project to analyze other datasets!