Analysis of NBA OfficiatingΒΆ

David HanΒΆ
May 17, 2024ΒΆ
The Typical Instagram Comment Section After a Close Game
Source: Instagram

IntroductionΒΆ

Regardless of who they root for, most fans of the NBA can agree that officiating is far from perfect. A seemingly winnable game can suddenly become out of reach after one or two bad calls. Early playoff exits and heartbreaking losses are often blamed on biased referees and what sour fanbases like to call "the script", just look at any social media platform after a close game.

With all of that being said, do NBA referees really deserve the criticism they receive? Are things actually rigged for certain teams? The main focus of this project is to answer some of these questions with data from previous seasons. This tutorial will walk you through the entire data science pipeline, from data collection to machine learning, and by the end of it you should have a better understanding of the factors impacting the accuracy of NBA referees. If we can use a model to predict incorrect calls based on the primary official of a game, we may even be able to hold specific refs accountable or find out who is fit for the job.

The tutorial will be separated into five main sections:

  1. Data Collection
  2. Data Processing
  3. Exploratory Analysis and Data Visualization
  4. Model Analysis, Hypothesis Testing, Machine Learning
  5. Insights

Data CollectionΒΆ

Before we can start exploring our topic, we need to find a dataset to work with. Since we want information related to foul calls, it makes the most sense to check the official NBA website for public data first. Luckily, the NBA has been publishing Last Two Minute Reports for every single game since the 2015-2016 season. An archive of these reports can be found here. As you may have guessed, Last Two Minute Reports assess the correctness of foul calls made within the last two minutes of a game and provide additional details such as the time left on the clock, the player who committed the foul, the player who was fouled, the type of foul called (technical, personal, defensive, blocking, etc.), and the review decision. There are four kinds of review decisions, CC = correct call, IC = incorrect call, CNC = correct non-call, and INC = incorrect non-call (a non-call is just a foul that cannot be challenged by a coach). For our purposes, an incorrect call is anything classified as an IC or INC. Some fans complain that the reports are useless as they do nothing to change game outcomes, but for data scientists, they are great! Unfortunately, the NBA does not provide the reports in .csv or spreadsheet format, so we will have to deal with that somehow.

An NBA Last Two Minute Report
Source: NBA Officiating Last Two Minute Reports

Now, we could build a webscraper and extract all of this data ourselves, or we could look around for existing datasets. With a bit of digging, we were able to find this handy website which has already scraped foul calls from every Last Two Minute Report from March 2015 to May 2024. Go ahead and download the .csv file available on the Github website. At this point, we can import the necessary Python libraries and start coding!

ImportsΒΆ
InΒ [1]:
# Import the necessary libraries
# NOTE: you may need to install the libraries first via "!pip install"
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
import statsmodels.formula.api as sm
import os
import datetime
from unicodedata import normalize
from io import StringIO
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score, roc_curve, roc_auc_score

If you are wondering what these libraries are for, here is a brief overview of the most important ones:

  • Pandas is a popular open-source tool that can be used to organize and work with data in Python. The main data structure used in Pandas is called a dataframe, which you can think of as a simple group of columns and rows. Because they are easy to manipulate and understand, dataframes will be our structure of choice for the rest of the tutorial.
  • NumPy is an extensive math library that we can use to perform fast operations on arrays. SciPy is similar to NumPy with greater emphasis on scientific computing.
  • Matplot is a plotting library that we will use to create static visualizations like line graphs and histograms. Seaborn essentially serves the same purpose.
  • The statsmodels.formula.api and Scikit-learn libraries are useful for running and evaluating machine learning algorithms or models.

Also, if you are not familiar with Jupyter notebook you can read about the platform here.

Get Foul Data from 2015-2024ΒΆ
InΒ [2]:
# Load the csv into a pandas dataframe
foul_df = pd.read_csv("foul_data_2015_2024.csv", low_memory=False)
foul_df.head()
Out[2]:
period time call_type committing disadvantaged decision comments game_details page file ... committing_min committing_team committing_side disadvantaged_min disadvantaged_team disadvantaged_side type2 time_min time_sec time2
0 Q4 01:52.0 Foul: Shooting Josh Smith Kevin Love CNC Smith (HOU) does not make contact with Love (C... Cavaliers @ Rockets (Mar 01, 2015) 1.0 L2M-CLE-HOU-3-1-15.pdf ... 30.950000 HOU home 35.016667 CLE away SHOOTING 1 52.0 1.866667
1 Q4 01:43.0 Foul: Shooting JR Smith James Harden CC Smith (CLE) makes contact with the body of Har... Cavaliers @ Rockets (Mar 01, 2015) 1.0 L2M-CLE-HOU-3-1-15.pdf ... 50.800000 CLE away 42.350000 HOU home SHOOTING 1 43.0 1.716667
2 Q4 01:32.0 Foul: Shooting Trevor Ariza LeBron James CC Ariza (HOU) makes contact with the shoulder of... Cavaliers @ Rockets (Mar 01, 2015) 1.0 L2M-CLE-HOU-3-1-15.pdf ... 42.233333 HOU home 42.100000 CLE away SHOOTING 1 32.0 1.533333
3 Q4 01:09.0 Foul: Loose Ball Terrence Jones Tristan Thompson CC Jones (HOU) makes contact with the arm of Thom... Cavaliers @ Rockets (Mar 01, 2015) 1.0 L2M-CLE-HOU-3-1-15.pdf ... 44.016667 HOU home 37.316667 CLE away LOOSE BALL 1 9.0 1.150000
4 Q4 00:53.0 Foul: Shooting Tristan Thompson Josh Smith CNC Smith (HOU) loses the ball as he goes up for t... Cavaliers @ Rockets (Mar 01, 2015) 1.0 L2M-CLE-HOU-3-1-15.pdf ... 37.316667 CLE away 30.950000 HOU home SHOOTING 0 53.0 0.883333

5 rows Γ— 64 columns

After using the Pandas read_csv() function to create a dataframe from the foul dataset (in my case the file is named foul_data_2015_2024.csv, but you can name it whatever you like), we can take a quick look at the data with the head() function. We will get into what each column represents in the next section. Every row represents a single foul call.

Data ProcessingΒΆ

Cleaning the Foul DataΒΆ
InΒ [3]:
# Create a new dataframe with selected columns
cols = [
    "date", "time", "season", "home_team", "away_team",
    "type", "committing", "disadvantaged", "decision", 
    "committing_team", "disadvantaged_team", "OFFICIAL_1"
]
foul_df_clean = foul_df[cols].copy()

# Rename columns 
foul_df_clean = foul_df_clean.rename(columns={
    "home_team": "home",
    "away_team": "away",
    "committing": "player",
    "disadvantaged": "opponent",
    "committing_team": "player team", 
    "disadvantaged_team": "opponent team",
    "OFFICIAL_1": "official"
})

# Map team abbreviations to names
all_teams = {
    "ATL": "Hawks", "BOS": "Celtics", "BKN": "Nets", "CHA": "Hornets", "CHI": "Bulls", 
    "CLE": "Cavaliers", "DAL": "Mavericks", "DEN": "Nuggets", "DET": "Pistons", "GSW": "Warriors", 
    "HOU": "Rockets", "IND": "Pacers", "LAC": "Clippers", "LAL": "Lakers", "MEM": "Grizzlies", 
    "MIA": "Heat", "MIL": "Bucks", "MIN": "Timberwolves", "NOP": "Pelicans", "NYK": "Knicks", 
    "OKC": "Thunder", "ORL": "Magic", "PHI": "76ers", "PHX": "Suns", "POR": "Trail Blazers", 
    "SAC": "Kings", "SAS": "Spurs", "TOR": "Raptors", "TOT": "Total", "UTA": "Jazz", "WAS": "Wizards"
}

# Map team abbreviations to their names for the player and opponent team columns
foul_df_clean["player team"] = foul_df_clean["player team"].map(all_teams)
foul_df_clean["opponent team"] = foul_df_clean["opponent team"].map(all_teams)

foul_df_clean.head()
Out[3]:
date time season home away type player opponent decision player team opponent team official
0 2015-03-01 01:52.0 2015 Rockets Cavaliers SHOOTING Josh Smith Kevin Love CNC Rockets Cavaliers Dan Crawford
1 2015-03-01 01:43.0 2015 Rockets Cavaliers SHOOTING JR Smith James Harden CC Cavaliers Rockets Dan Crawford
2 2015-03-01 01:32.0 2015 Rockets Cavaliers SHOOTING Trevor Ariza LeBron James CC Rockets Cavaliers Dan Crawford
3 2015-03-01 01:09.0 2015 Rockets Cavaliers LOOSE BALL Terrence Jones Tristan Thompson CC Rockets Cavaliers Dan Crawford
4 2015-03-01 00:53.0 2015 Rockets Cavaliers SHOOTING Tristan Thompson Josh Smith CNC Cavaliers Rockets Dan Crawford

While it is nice that the dataset has many different features to choose from, we only need a select few for analysis and visualization. Of the 50+ available columns, we will use the following:

  • Date = self explanatory.
  • Time = the amount of time left on the clock when the foul was called, in MINUTES:SECONDS.MILLISECONDS format.
  • Season = the year.
  • Home team = the name of the home team.
  • Away team = the name of the away team.
  • Type = the type of foul called.
  • Committing = the name of the player who committed the foul.
  • Disadvantaged = the name of the player who was fouled.
  • Decision = the review decision for the foul, CC, CNC, IC, or INC.
  • Committing team = the team of the player who committed the foul.
  • Disadvantaged team = the team of the player who got fouled.
  • Official 1 = the primary referee of the game in which the foul was called.

In addition to filtering out these columns, we will rename some of them so they are a bit easier to work with and understand. Lastly we will use the map() function to change the player and opponent team columns so they match the home and away team columns. If you take a look at the player/opponent team and home/away team columns before the modification you will notice that their values do not follow the same format. The player/opponent teams use abbreviations instead of full names ("NYK" instead of "Knicks" for example).

If there is any missing data (NaNs) in a row, we will drop it altogether. After cleaning up the dataframe, displaying foul_df_clean should look like the above block.

Remove NaN Observations and Add a New Column to Assess Correct and Incorrect CallsΒΆ
InΒ [4]:
# Remove rows where the decision column is NaN
foul_df_clean = foul_df_clean[~foul_df_clean["decision"].isna()].reset_index(drop=True)

def correct_decision(decision):
    if decision in ["CC", "CNC"]:
        return 1
    else:
        return 0

# Create a column that serves as a binary indicator for whether the referees acted correctly or incorrectly
foul_df_clean["correct decision"] = foul_df_clean["decision"].apply(correct_decision)
foul_df_clean.head()
Out[4]:
date time season home away type player opponent decision player team opponent team official correct decision
0 2015-03-01 01:52.0 2015 Rockets Cavaliers SHOOTING Josh Smith Kevin Love CNC Rockets Cavaliers Dan Crawford 1
1 2015-03-01 01:43.0 2015 Rockets Cavaliers SHOOTING JR Smith James Harden CC Cavaliers Rockets Dan Crawford 1
2 2015-03-01 01:32.0 2015 Rockets Cavaliers SHOOTING Trevor Ariza LeBron James CC Rockets Cavaliers Dan Crawford 1
3 2015-03-01 01:09.0 2015 Rockets Cavaliers LOOSE BALL Terrence Jones Tristan Thompson CC Rockets Cavaliers Dan Crawford 1
4 2015-03-01 00:53.0 2015 Rockets Cavaliers SHOOTING Tristan Thompson Josh Smith CNC Cavaliers Rockets Dan Crawford 1

We previously mentioned that incorrect calls include both IC and INC decisions. To make things easier later on, we will create another column, "correct decision", to numerically indicate the correctness of each foul call. Since a foul call can only be wrong or right, we will use ones for correct calls (when decision == CNC or CC) and zeros for incorrect calls (when decision == INC or IC).

In the next code block we will also use the time column to create a new column containing the seconds remaining in a game.

Create a New Column for Seconds RemainingΒΆ
InΒ [5]:
# Splitting time into minutes and seconds
foul_df_clean[["min", "sec"]] = foul_df_clean["time"].str.split(":", expand=True)
foul_df_clean["min"] = foul_df_clean["min"].astype(int)
foul_df_clean["sec"] = foul_df_clean["sec"].astype(float)

# Calculate total time in seconds
foul_df_clean["seconds"] = foul_df_clean["min"] * 60 + foul_df_clean["sec"]
foul_df_clean.head()
Out[5]:
date time season home away type player opponent decision player team opponent team official correct decision min sec seconds
0 2015-03-01 01:52.0 2015 Rockets Cavaliers SHOOTING Josh Smith Kevin Love CNC Rockets Cavaliers Dan Crawford 1 1 52.0 112.0
1 2015-03-01 01:43.0 2015 Rockets Cavaliers SHOOTING JR Smith James Harden CC Cavaliers Rockets Dan Crawford 1 1 43.0 103.0
2 2015-03-01 01:32.0 2015 Rockets Cavaliers SHOOTING Trevor Ariza LeBron James CC Rockets Cavaliers Dan Crawford 1 1 32.0 92.0
3 2015-03-01 01:09.0 2015 Rockets Cavaliers LOOSE BALL Terrence Jones Tristan Thompson CC Rockets Cavaliers Dan Crawford 1 1 9.0 69.0
4 2015-03-01 00:53.0 2015 Rockets Cavaliers SHOOTING Tristan Thompson Josh Smith CNC Cavaliers Rockets Dan Crawford 1 0 53.0 53.0

Exploratory Analysis and Data VisualizationΒΆ

In this section, we are going to create a few different plots with our cleaned dataset and try to find some possible relationships between the variables. Feel free to recreate this part and try different graphs, the point of this stage is to experiment a bit!

For the first graph, let's see how referees have performed over time by plotting the average foul call accuracy for each season. After grouping rows together by season using groupby(), the mean() function will allow us to quickly calculate the average value of the "correct decision" column for each group. This is basically the same as finding: $$\frac{\textrm{\# of correct decisions}}{\textrm{total decisions}}$$

After plotting with the Matplot library, you will notice that referees have *generally* gotten better over time. We will keep this in mind moving forward.

Plot Accuracy of Foul Calls Over TimeΒΆ
InΒ [6]:
# Group by season and calculate the mean accuracy of decisions for each season
season_accuracy = foul_df_clean.groupby("season")["correct decision"].mean()

# Plotting
plt.figure(figsize=(10, 6))
season_accuracy.plot(marker="o", linestyle="-")
plt.title("Accuracy of Foul Calls Over Seasons")
plt.xlabel("Season")
plt.ylabel("Mean Call Accuracy")
plt.xticks(season_accuracy.index)  
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

To get a better idea of how biased (or unbiased) referees are, let's count the number of incorrect calls against each team. We can tell if a foul call puts a team at a disadvantage if the decision is incorrect, and if the player who got called for the foul is on that team. In the following code block, we will use nested Python dictionaries to map team names to the number of incorrect calls against them for every season.

Count the Number of Incorrect Calls Against Each Team for Each SeasonΒΆ
InΒ [33]:
# Get the number of unique team names
teams = foul_df_clean["home"].unique()

# Dictionary to store the results
team_season_incorrect_counts = {}

for team in teams:
    # Initialize a dictionary for the current team
    team_season_incorrect_counts[team] = {}
    
    for season in foul_df_clean["season"].unique():
        # Filter data for the current team and season
        team_season_data = foul_df_clean[(foul_df_clean["player team"] == team) & (foul_df_clean["season"] == season)]
        
        # Count observations where correct decision is 0 (incorrect)
        count_incorrect = len(team_season_data[(team_season_data["correct decision"] == 0)])
        
        # Store the count for the current season and move onto the next season
        team_season_incorrect_counts[team][season] = count_incorrect

counter = 0
for team, season_counts in team_season_incorrect_counts.items():
    counter += 1
    if counter == 5:
        break
    print(f"Team: {team}")
    for season, count in season_counts.items():
        print(f"{season} Season: {count} incorrect calls")
Team: Rockets
2015 Season: 4 incorrect calls
2016 Season: 45 incorrect calls
2017 Season: 26 incorrect calls
2018 Season: 12 incorrect calls
2019 Season: 43 incorrect calls
2020 Season: 10 incorrect calls
2021 Season: 7 incorrect calls
2022 Season: 10 incorrect calls
2023 Season: 14 incorrect calls
2024 Season: 25 incorrect calls
Team: Celtics
2015 Season: 7 incorrect calls
2016 Season: 13 incorrect calls
2017 Season: 28 incorrect calls
2018 Season: 18 incorrect calls
2019 Season: 24 incorrect calls
2020 Season: 16 incorrect calls
2021 Season: 21 incorrect calls
2022 Season: 30 incorrect calls
2023 Season: 31 incorrect calls
2024 Season: 9 incorrect calls
Team: Nuggets
2015 Season: 10 incorrect calls
2016 Season: 24 incorrect calls
2017 Season: 14 incorrect calls
2018 Season: 22 incorrect calls
2019 Season: 29 incorrect calls
2020 Season: 26 incorrect calls
2021 Season: 15 incorrect calls
2022 Season: 19 incorrect calls
2023 Season: 20 incorrect calls
2024 Season: 18 incorrect calls
Team: Nets
2015 Season: 14 incorrect calls
2016 Season: 19 incorrect calls
2017 Season: 17 incorrect calls
2018 Season: 18 incorrect calls
2019 Season: 23 incorrect calls
2020 Season: 23 incorrect calls
2021 Season: 17 incorrect calls
2022 Season: 27 incorrect calls
2023 Season: 16 incorrect calls
2024 Season: 11 incorrect calls

We might also want to know how many favorable calls each team gets every season. This is similar to what we did in the previous code block with some minor tweaks. In this case, the player who gets an incorrect foul call (and gains possession of the ball or gets to shoot free throws as a result) is considered favored. When it comes to implementing this, we will increment a team's favorable call count when the decision is incorrect and the player who got fouled, AKA the opponent, is on the current team.

Count the Number of Incorrect Calls That Favor Each Team for Each SeasonΒΆ
InΒ [32]:
# Dictionary to store the results
team_season_favorable_counts = {}

for team in teams:
    team_season_favorable_counts[team] = {}
    
    for season in foul_df_clean["season"].unique():
        # Simple change: player team -> opponent team
        # We want to find observations where players on the current team were 
        # given incorrect foul calls (which put the other team at a disadvantage)
        team_season_data = foul_df_clean[(foul_df_clean["opponent team"] == team) & (foul_df_clean["season"] == season)]
        
        # Count observations where correct decision is 0 (incorrect)
        count_favorable = len(team_season_data[(team_season_data["correct decision"] == 0)])
        
        # Store the count for the current season and move onto the next season
        team_season_favorable_counts[team][season] = count_favorable

counter = 0
for team, season_counts in team_season_favorable_counts.items():
    counter += 1
    if counter == 5:
        break
    print(f"Team: {team}")
    for season, count in season_counts.items():
        print(f"{season} Season: {count} favorable incorrect calls")
Team: Rockets
2015 Season: 6 favorable incorrect calls
2016 Season: 22 favorable incorrect calls
2017 Season: 16 favorable incorrect calls
2018 Season: 14 favorable incorrect calls
2019 Season: 25 favorable incorrect calls
2020 Season: 21 favorable incorrect calls
2021 Season: 6 favorable incorrect calls
2022 Season: 15 favorable incorrect calls
2023 Season: 14 favorable incorrect calls
2024 Season: 28 favorable incorrect calls
Team: Celtics
2015 Season: 5 favorable incorrect calls
2016 Season: 22 favorable incorrect calls
2017 Season: 12 favorable incorrect calls
2018 Season: 17 favorable incorrect calls
2019 Season: 20 favorable incorrect calls
2020 Season: 21 favorable incorrect calls
2021 Season: 12 favorable incorrect calls
2022 Season: 23 favorable incorrect calls
2023 Season: 31 favorable incorrect calls
2024 Season: 15 favorable incorrect calls
Team: Nuggets
2015 Season: 8 favorable incorrect calls
2016 Season: 18 favorable incorrect calls
2017 Season: 10 favorable incorrect calls
2018 Season: 27 favorable incorrect calls
2019 Season: 30 favorable incorrect calls
2020 Season: 29 favorable incorrect calls
2021 Season: 30 favorable incorrect calls
2022 Season: 36 favorable incorrect calls
2023 Season: 39 favorable incorrect calls
2024 Season: 11 favorable incorrect calls
Team: Nets
2015 Season: 6 favorable incorrect calls
2016 Season: 12 favorable incorrect calls
2017 Season: 14 favorable incorrect calls
2018 Season: 30 favorable incorrect calls
2019 Season: 29 favorable incorrect calls
2020 Season: 24 favorable incorrect calls
2021 Season: 12 favorable incorrect calls
2022 Season: 20 favorable incorrect calls
2023 Season: 25 favorable incorrect calls
2024 Season: 11 favorable incorrect calls

We may also expect the home and away teams to be treated differently every game, so let's go ahead and repeat what we did in the previous two blocks but for the home and away columns.

Alright, this might seem like a lot, but we are almost at the finish line!

Repeat the Previous Two Steps for Home and Away TeamsΒΆ
InΒ [28]:
# Dictionary to store the results
home_away_counts = {"Home":{},"Away":{}}

for season in foul_df_clean["season"].unique():
    home_season_inc = foul_df_clean[(foul_df_clean["player team"] == foul_df_clean["home"]) & (foul_df_clean["season"] == season)]
    away_season_inc = foul_df_clean[(foul_df_clean["player team"] == foul_df_clean["away"]) & (foul_df_clean["season"] == season)]    
    home_season_fav = foul_df_clean[(foul_df_clean["opponent team"] == foul_df_clean["home"]) & (foul_df_clean["season"] == season)]
    away_season_fav = foul_df_clean[(foul_df_clean["opponent team"] == foul_df_clean["away"]) & (foul_df_clean["season"] == season)]
    
    # Count observations where correct decision is 0 (incorrect) for the home team
    count_incorrect_home = len(home_season_inc[(home_season_inc["correct decision"] == 0)])
    # Do the same for the away team
    count_incorrect_away = len(away_season_inc[(away_season_inc["correct decision"] == 0)])
    
    # Count observations where correct decision is 0 (incorrect) but favors the home team
    count_favorable_home = len(home_season_fav[(home_season_fav["correct decision"] == 0)])
    # Do the same for the away team
    count_favorable_away = len(away_season_fav[(away_season_fav["correct decision"] == 0)])
    
    # Store the counts for the current season and move onto the next season
    # The counts are represented as a tuple (# incorrect, # favorable incorrect)
    home_away_counts["Home"][season] = (count_incorrect_home, count_favorable_home)
    home_away_counts["Away"][season] = (count_incorrect_away, count_favorable_away)

for team, season_counts in home_away_counts.items():
    print(f"{team}")
    for season, (inc_count, fav_count) in season_counts.items():
        print(f"{season} Season: {inc_count} incorrect calls, {fav_count} favorable calls")
Home
2015 Season: 127 incorrect calls, 87 favorable calls
2016 Season: 388 incorrect calls, 264 favorable calls
2017 Season: 267 incorrect calls, 209 favorable calls
2018 Season: 313 incorrect calls, 228 favorable calls
2019 Season: 371 incorrect calls, 325 favorable calls
2020 Season: 259 incorrect calls, 293 favorable calls
2021 Season: 235 incorrect calls, 244 favorable calls
2022 Season: 349 incorrect calls, 322 favorable calls
2023 Season: 365 incorrect calls, 334 favorable calls
2024 Season: 227 incorrect calls, 254 favorable calls
Away
2015 Season: 125 incorrect calls, 95 favorable calls
2016 Season: 369 incorrect calls, 304 favorable calls
2017 Season: 320 incorrect calls, 185 favorable calls
2018 Season: 280 incorrect calls, 248 favorable calls
2019 Season: 351 incorrect calls, 343 favorable calls
2020 Season: 296 incorrect calls, 255 favorable calls
2021 Season: 248 incorrect calls, 228 favorable calls
2022 Season: 328 incorrect calls, 346 favorable calls
2023 Season: 338 incorrect calls, 365 favorable calls
2024 Season: 254 incorrect calls, 226 favorable calls
Plot the DataΒΆ

Okay, now that we have counted up calls for every team across every season, we can use our data to plot incorrect and favorable call counts over time. We can also calculate the league average number of incorrect and favorable calls for every season using the mean() function once again. Follow the code below and take a look at the results, you might notice something interesting!

Interestingly enough, there seems to be some significant team-level differences in call frequency. However, these trends shift quite a bit season-to-season. It may be hard to see, but take a look at the Nuggets number of favorable calls relative to the mean line. In 2023, the year they won the finals, the Nuggets were actually far above average in terms of their favorable calls. In contrast, the Warriors fell below the average number of favorable calls in 2022 when they won the championship over the Celtics. Thus we might not be able to make the generalization that some teams receive better treatment from the refs than other teams, but we can keep exploring other factors! If you want to feel better about the team you support, see how they compare to the rest of the league by making some minor changes to the code below.

InΒ [20]:
# Create a dataframe from the team_season_incorrect_counts dictionary and transpose 
team_season_incorrect_df = pd.DataFrame(team_season_incorrect_counts)
team_season_incorrect_df = team_season_incorrect_df.T

# Create a dataframe from the team_season_favorable_counts dictionary and transpose 
team_season_favorable_df = pd.DataFrame(team_season_favorable_counts)
team_season_favorable_df = team_season_favorable_df.T

# Calculate the average number of incorrect and favorable incorrect calls per season
avg_incorrect_calls = team_season_incorrect_df.mean(axis=0)
avg_favorable_calls = team_season_favorable_df.mean(axis=0)

# Plotting
fig, axs = plt.subplots(2, 1, figsize=(12, 14))

# Plot for number of incorrect calls against each team over seasons
for team in team_season_incorrect_df.index:
    axs[0].plot(team_season_incorrect_df.columns, team_season_incorrect_df.loc[team], label=team)

# Plot the average line for incorrect calls
axs[0].plot(team_season_incorrect_df.columns, avg_incorrect_calls, label="Average", color="black", linewidth=2, linestyle="--")

axs[0].set_title("Number of Incorrect Calls Against Each Team Over Seasons")
axs[0].set_xlabel("Season")
axs[0].set_ylabel("Number of Incorrect Calls")
axs[0].grid(True)
axs[0].legend(loc="center left", bbox_to_anchor=(1, 0.5))

# Plot for number of favorable incorrect calls for each team over seasons
for team in team_season_favorable_df.index:
    axs[1].plot(team_season_favorable_df.columns, team_season_favorable_df.loc[team], label=team)

# Plot the average line for favorable calls
axs[1].plot(team_season_favorable_df.columns, avg_favorable_calls, label="Average", color="black", linewidth=2, linestyle="--")

axs[1].set_title("Number of Favorable Incorrect Calls For Each Team Over Seasons")
axs[1].set_xlabel("Season")
axs[1].set_ylabel("Number of Favorable Incorrect Calls")
axs[1].grid(True)
axs[1].legend(loc="center left", bbox_to_anchor=(1, 0.5))

plt.tight_layout()
plt.show()
No description has been provided for this image

Since we already counted calls for the home and away team columns, we can plot them as well. Overall, the away team seems to be favored in most seasons: 2015-2016, 2018-2019, 2022-2023.

InΒ [13]:
# Get number of favorable and incorrect calls by season for home and away teams
seasons = sorted(list(home_away_counts["Home"].keys()))
home_incorrect_counts = [home_away_counts["Home"][season][0] for season in seasons]
away_incorrect_counts = [home_away_counts["Away"][season][0] for season in seasons]
home_favorable_counts = [home_away_counts["Home"][season][1] for season in seasons]
away_favorable_counts = [home_away_counts["Away"][season][1] for season in seasons]

# Plotting
fig, axs = plt.subplots(2, 1, figsize=(12, 10))

# Plot for number of incorrect calls over seasons for the home and away teams
axs[0].plot(seasons, home_incorrect_counts, label="Home")
axs[0].plot(seasons, away_incorrect_counts, label="Away")
axs[0].set_title("Number of Incorrect Calls Over Seasons for Home and Away Teams")
axs[0].set_xlabel("Season")
axs[0].set_ylabel("Number of Incorrect Calls")
axs[0].legend()
axs[0].grid(True)

# Plot for number of favorable calls over seasons for the home and away teams
axs[1].plot(seasons, home_favorable_counts, label="Home")
axs[1].plot(seasons, away_favorable_counts, label="Away")
axs[1].set_title("Number of Favorable Calls Over Seasons for Home and Away Teams")
axs[1].set_xlabel("Season")
axs[1].set_ylabel("Number of Favorable Calls")
axs[1].legend()
axs[1].grid(True)

plt.tight_layout()
plt.show()
No description has been provided for this image

Of course, not every referee is the same when it comes to accuracy, they aren't robots! To visualize this, we can count and plot the number of incorrect calls for every unique official listed in the dataset. This is not enough though, because some refs have not officiated as many games and therefore are less likely to get the same number of calls wrong as the more experienced referees. To combat this, we can also keep track of how many standard deviations away each referee is from the average accuracy (proportion of correct calls to total calls). Using this method will allow us to level the playing field a bit and analyze the performance of each referee. You might notice that there is quite a bit of variability in accuracy across officials. Interesting!

Plot the Number of Incorrect Calls and Call Accuracy by OfficialΒΆ
InΒ [17]:
# Group by official and count the total number of calls made by each official
total_calls_by_official = foul_df_clean.groupby("official").size()

# Then get all of the incorrect calls
incorrect_calls_df = foul_df_clean[foul_df_clean["correct decision"] == 0]

# Group by official and count their incorrect calls
incorrect_calls_by_official = incorrect_calls_df.groupby("official").size()

# Calculate accuracy (proportion of incorrect calls)
accuracy_by_official = (total_calls_by_official - incorrect_calls_by_official) / total_calls_by_official

# Calculate mean call accuracy and standard deviation
mean_accuracy = accuracy_by_official.mean()
std_accuracy = accuracy_by_official.std()

# Calculate number of standard deviations from the mean (z-score) for each official
z_score = (accuracy_by_official - mean_accuracy) / std_accuracy

# Sort officials based on total incorrect calls
total_calls_by_official = total_calls_by_official.sort_values()

# Sort officials based on number of standard deviations from the mean accuracy
z_score = z_score.sort_values()

# Create a figure and two subplots
fig, axes = plt.subplots(2, 1, figsize=(10, 12))

# Plot total incorrect calls by official
axes[0].bar(total_calls_by_official.index, total_calls_by_official, color="lightblue")
axes[0].set_title("Total Incorrect Calls by Official")
axes[0].set_xlabel("Official")
axes[0].set_ylabel("Number of Incorrect Calls")
axes[0].tick_params(axis="x", rotation=90)
axes[0].tick_params(axis="x", labelsize=6)  # Adjust label size

# Plot number of standard deviations from mean accuracy by official
axes[1].bar(z_score.index, z_score, color="orange")
axes[1].axhline(0, color="gray", linestyle="--", linewidth=0.5)  # Add horizontal line at y=0
axes[1].set_title("Number of Standard Deviations from Mean Accuracy by Official")
axes[1].set_xlabel("Official")
axes[1].set_ylabel("Standard Deviations from Mean Accuracy")
axes[1].tick_params(axis="x", rotation=90)
axes[1].tick_params(axis="x", labelsize=6)  # Adjust label size

plt.tight_layout()
plt.show()
No description has been provided for this image

The last two plots in this section are much easier to complete. To plot the frequency of incorrect calls as the number of seconds remaining decreases, we can just remove observations where the number of seconds is above 120, or 2 minutes. To plot the frequency of incorrect calls depending on the type of foul called, we can use the groupby() and mean() functions.

Based on the charts, it seems like referees tend to get more calls wrong as game time decreases. This can be especially concerning for close games! Also, the correctness of calls appears to shift significantly across foul types, perhaps due to some calls being more common than others.

Plot the Number of Incorrect Calls Over Remaining Game TimeΒΆ
InΒ [21]:
# Filter out rows in the last two minutes
l2m = foul_df_clean[foul_df_clean["seconds"] <= 120]

# Count the number of incorrect calls in the last two minutes
incorrect_calls_l2m = l2m[l2m["correct decision"] == 0]

# Plotting
plt.figure(figsize=(8, 6))
incorrect_calls_l2m["seconds"].hist(bins=24, color="skyblue", edgecolor="black")
plt.xlabel("Seconds Remaining")
plt.ylabel("Number of Incorrect Calls")
plt.title("Number of Incorrect Calls in the Last Two Minutes")
plt.grid(axis="y")
plt.show()
No description has been provided for this image
Plot the Accuracy of Calls for Each Type of Foul CallΒΆ
InΒ [251]:
# Calculate accuracy for each type of call
accuracy_by_type = foul_df_clean.groupby("type")["correct decision"].mean().sort_values()

# Plot the accuracy for each type of call
plt.figure(figsize=(10, 6))
sns.barplot(x=accuracy_by_type.index,y=accuracy_by_type.values)
plt.title("Accuracy of Calls by Type")
plt.xlabel("Type of Call")
plt.ylabel("Accuracy")
plt.tick_params(axis="x", rotation=90, labelsize=6)
plt.show()
No description has been provided for this image

With all of the exploration complete, I think we are ready to start coming up with some possible hypotheses.

Model Analysis, Hypothesis Testing, Machine LearningΒΆ

At this stage of the data science pipeline, we will use a few well-known modeling and machine learning techniques to see how valid our assumptions are. To be more specific, we will train logistic regression and random forest models on our cleaned data and see how useful certain variables like official and remaining game time are for predicting incorrect foul calls. We will explain the different models later on, but first we have to prepare our dataset for training.

We will use the same independent (predictors) and dependent (response) variables for both models. Independent: season, seconds remaining, foul type, official. Dependent: correct decision.

If you try to split the dataset and train a model with it as-is, you will likely face some errors. This is because official names and foul types are represented as strings, which you cannot perform mathematical operations on! Fixing this is pretty simple with one-hot encoding, a method that converts categorical variables into unique sequences of 1s and 0s. For example, we can map the values [Cat, Dog, Fish] to [[1,0,0],[0,1,0 ],[0,0,1]]. If you want to learn more about one-hot encoding, you can read about it here. In Python, all we have to do is run the Pandas get_dummies() function to one hot encode the season, type, and official columns. Even though seasons are technically in numeric format, we will view them as categorical values.

While we are at it, we can standardize the seconds column by calculating the mean and standard deviation, and using the formula: $$Z = \frac{x - \textrm{mean}}{\textrm{standard deviation}}$$ Standardization transforms the column so it has a mean of 0 and a standard deviation of 1. The objective of this is to fit the distribution of values to a standard normal distribution and reduce sensitivity to outliers (since the number of seconds remaining can vary quite a bit).

Preprocess DataΒΆ
InΒ [22]:
# Create a new dataset which includes the features we want to test on
# Our independent variables (predictors) are season, seconds remaining, type of foul called, and the official
# Our dependent variable (response) is the correctness of the decision
foul_df_final = foul_df_clean[["season", "seconds", "type", "official", "correct decision"]].copy()

# Standardize the seconds column
mean_seconds = np.mean(foul_df_final["seconds"])
std_seconds = np.std(foul_df_final["seconds"])
foul_df_final["seconds"] = (foul_df_final["seconds"] - mean_seconds) / std_seconds

# One hot encode seasons and official names
foul_df_final = pd.get_dummies(foul_df_final, columns=["season", "type", "official"], prefix=["season", "type", "official"])

foul_df_final.head()
Out[22]:
seconds correct decision season_2015 season_2016 season_2017 season_2018 season_2019 season_2020 season_2021 season_2022 ... official_Suyash Mehta official_Tom Washington official_Tony Brothers official_Tre Maddox official_Tyler Ford official_Tyler Mirkovich official_Tyler Ricks official_Violet Palmer official_Vladimir Voyard-Tadal official_Zach Zarba
0 1.452359 1 True False False False False False False False ... False False False False False False False False False False
1 1.230863 1 True False False False False False False False ... False False False False False False False False False False
2 0.960147 1 True False False False False False False False ... False False False False False False False False False False
3 0.394103 1 True False False False False False False False ... False False False False False False False False False False
4 0.000334 1 True False False False False False False False ... False False False False False False False False False False

5 rows Γ— 179 columns

Checking for and Handling Target Class ImbalanceΒΆ
InΒ [23]:
# Count the occurrences of each value in the correct decision column
correct_decision_counts = foul_df_final["correct decision"].value_counts()
print(correct_decision_counts)
correct decision
1    74543
0     5837
Name: count, dtype: int64

Oh no, it looks like our dataset is imbalanced! Class imbalance occurs when the minority class, which is "correct decision == 0" in this case, appears far less than the majority class. There are over 12 times as many correct decisions as there are incorrect decisions. There are a few ways to deal with this, but the easiest solution is to pass a certain parameter when running our machine learning models (more on this soon). Before that, we can split our data into training and test sets with the train_test_split function.

Split Data into Train and Test SplitsΒΆ
InΒ [26]:
# Prepare the data for a train test split
# Our predictors are season and seconds remaining, 
# Our response is a correct decision (1) or incorrect decision (0)
X = foul_df_final.drop(columns=["correct decision"])  # All columns except the target column
y = foul_df_final["correct decision"]  # Target column
# Test size = the proportion of the dataset to include in the test split
# Random state = controls the shuffling of the data before the split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

Logistic regression models are used for predicting binary outcomes like incorrect and correct foul calls. It uses something known as the sigmoid function to produce a probability between 0 and 1 using our independent variables and calculated weights. This probability is used to make predictions based on a given threshold, typically 0.5. All we have to do is provide the LogisticRegression model with a number of iterations and a parameter, class_weight = "balanced", which indicates that our data is class-imbalanced. In the background, the model will use our training data to find the coefficients for our independent variables which best estimate the training outputs. Once the model is trained, we can assess its performance by making predictions on the test set with the predict() function. We can also see what the most impactful variables are by printing the top 5 largest coefficients. From this we can gather that SUPPORT RULING, OVERTURN RULING, and PERSONAL TAKE foul types have large impacts on results with small changes in data.

Logistic Regression: Predicting/Classifying Incorrect CallsΒΆ
InΒ [235]:
# Initialize and train the model
model = LogisticRegression(max_iter=2000, class_weight="balanced")
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Get the coefficients for the predictors
coefficients = model.coef_[0]
top_indices = np.argsort(np.abs(coefficients))[::-1][:5]
for index in top_indices:
    feature = X_train.columns[index]
    coefficient = coefficients[index]
    print(f"Feature: {feature}, Coefficient: {coefficient}")
Feature: type_SUPPORT RULING, Coefficient: 4.351317369954331
Feature: type_OVERTURN RULING, Coefficient: 3.5367877183853538
Feature: type_PERSONAL TAKE, Coefficient: 3.358119584102903
Feature: type_OUT OF BOUNDS - BAD PASS TURN, Coefficient: 2.58783960803185
Feature: official_Sha'Rae Mitchell, Coefficient: 2.2084677967373034

To evaluate the model's performance, we can use the following metrics:

  • Accuracy = the overall percentage of correct predictions.
  • Precision = the percentage of positive predictions that are actually correct.
  • Recall = the percentage of actual positives that are correctly identified.
  • F1 = the harmonic mean of precision and recall.

For visualization, we can plot a confusion matrix, which is a table showing the number of true/false positives and true/false negatives. A true positive is a correct prediction of the positive class, a false positive is an incorrect prediction of the positive class, a true negative is a correct prediction of the negative class, and a false negative is an incorrect prediction of the negative class. Our logistic regression seems to have a massive number of true positives compared to everything else, mostly because there are way more positive examples than negative examples in our dataset. To further reduce the effect of class imbalance, we might want to reduce the number of positive examples in our dataset or use some alternative methods like oversampling.

Finally, we can plot a ROC curve that shows the tradeoff between the model's True Positive and False Positive Rates (TPR and FPR). Without going into too much detail, a curve that is closer to the top-left corner is generally considered to be better than a diagonal line. If the curve is a diagonal line, it means our model predictions are no better than random guesses. The area under the curve, or AUC, summarizes the performance of our model in a single value between 0 and 1. An AUC of 1 indicates a perfect model, and anything below 0.5 is worthless. Thankfully, we achieved an AUC of 0.69. Not perfect by any means, but not meaningless either! For more information on these metrics, go here.

Testing our ModelΒΆ
InΒ [236]:
# Evaluate the model using a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot confusion matrix as a heatmap
plt.figure(figsize=(4, 2))
sns.heatmap(conf_matrix, annot=True, cmap="Blues", fmt="g", cbar=False)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)

# Calculate recall
recall = recall_score(y_test, y_pred)
print("Recall:", recall)

# Calculate F1-score
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)
No description has been provided for this image
Accuracy: 0.7515551132122419
Precision: 0.9500537501033656
Recall: 0.7721621076685261
F1 Score: 0.8519205101586831
Plotting the ROC CurveΒΆ
InΒ [237]:
# Get predicted probabilities for the positive class
y_probs = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_probs, pos_label=1)

# Calculate AUC (Area Under the Curve)
auc = roc_auc_score(y_test, y_probs)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color="blue", label=f"ROC Curve (AUC = {auc:.2f})")
plt.plot([0, 1], [0, 1], color="gray", linestyle="--", lw=1)
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
No description has been provided for this image

Let's move on and compare our logistic regression to a random forest model. The random forest classifier is an ensemble method that combines the predictions of multiple decision trees to produce a more accurate prediction than any decision tree on its own. Usually, the final prediction is made by taking the majority vote of the trees. Decision trees split input data into branches based on feature/predictor values, and come up with predictions at the leaf level. The variable used to split the data at any given node is determined through random feature selection, a process that randomly selects a subset of predictors (like official and seconds, or season and call type). You can learn about the model in more detail here.

After calculating the same metrics, we can see a few things:

  • The F1 score is higher than the logistic regression's. This means the random forest model is better at predicting correct foul calls.
  • The AUC is slightly lower than the logistic regression's. This means the random forest model is slightly worse at distinguishing between incorrect and correct foul calls.
  • The most important feature is not a specific call type, it is the number of seconds left in a game.
Comparing the Results to a Random Forest ClassifierΒΆ
InΒ [25]:
# Initialize and train the Random Forest model
model_rf = RandomForestClassifier(random_state=42, class_weight="balanced")
model_rf.fit(X_train, y_train)

# Predictions
y_pred_rf = model_rf.predict(X_test)

# Evaluate the model using a confusion matrix
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)

# Plot confusion matrix as a heatmap
plt.figure(figsize=(4, 2))
sns.heatmap(conf_matrix_rf, annot=True, cmap="Blues", fmt="g", cbar=False)
plt.title("Confusion Matrix (Random Forest)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Calculate accuracy, precision, recall, and F1-score
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
print("Accuracy (Random Forest):", accuracy_rf)
print("Precision (Random Forest):", precision_rf)
print("Recall (Random Forest):", recall_rf)
print("F1 Score (Random Forest):", f1_rf)

# Get predicted probabilities for the positive class
y_probs_rf = model_rf.predict_proba(X_test)[:, 1]

# Calculate ROC curve and AUC
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, y_probs_rf, pos_label=1)
auc_rf = roc_auc_score(y_test, y_probs_rf)

# Plot ROC curve
plt.figure(figsize=(6, 4))
plt.plot(fpr_rf, tpr_rf, color="blue", label=f"ROC Curve (AUC = {auc_rf:.2f})")
plt.plot([0, 1], [0, 1], color="gray", linestyle="--", lw=1)
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

# Get the feature importances
feature_importances_rf = model_rf.feature_importances_

# Sort the features based on their importance
sorted_indices_rf = np.argsort(feature_importances_rf)[::-1]
top_features_rf = X_train.columns[sorted_indices_rf][:5]
top_importances_rf = feature_importances_rf[sorted_indices_rf][:5]

# Plot the top 5 features
plt.figure(figsize=(6, 4))
plt.barh(top_features_rf, top_importances_rf, color="skyblue")
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Top 5 Feature Importances")
plt.gca().invert_yaxis()  
plt.show()
No description has been provided for this image
Accuracy (Random Forest): 0.8903956208011943
Precision (Random Forest): 0.9306586118589533
Recall (Random Forest): 0.9525505746353922
F1 Score (Random Forest): 0.9414773482130995
No description has been provided for this image
No description has been provided for this image

InsightΒΆ

Chris Paul and Clippers Head Coach Doc Rivers Arguing with Referee Scott Foster
Source: Yahoo Sports

This part of the data lifecycle is where we try to draw conclusions from our analysis and modeling. So, what have we learned here? Well, we can say a few things:

  1. Since 2015, referees have gotten better at making foul calls at the end of games.
  2. Some teams get more favorable calls than others, but this is not consistent across seasons.
  3. Accuracy among officials can vary drastically.
  4. Certain foul types get incorrectly called a lot more than others, like double technical fouls. This makes sense as technical fouls are quite subjective and can be swayed by the emotions of a referee.
  5. Foul calls get increasingly inaccurate as time expires. Time appears to be the most important feature when it comes to predicting the correctness of a foul call. Overall, it is safe to say that referees are not as bad as we might say they are. Although they get calls wrong, there are many factors outside of their control which make it difficult to not make mistakes. With that in mind, referees may still have biases that cannot be found in our dataset. For example, officials could treat star players better than benchwarmers, but there is no way of knowing that without having a player dataset on hand as well. The answers to our questions are still unclear, but we hope that you can take what you learned from this project to analyze other datasets!