ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (2024)

Published in

Dev Genius

14 min read

Jan 23, 2023

The new year brings many joys; one of the foremost of which is the NFL Playoffs. However, the end of the regular season marks the end of the fantasy football season, the end of hilarious memes, and the end of fantasy league smack talk until September. This year as league manager (of a non-PPR league), I tried to make things more memorable. I decided to spice things up the best way I knew how: data science.

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (3)

Lineup decisions are difficult for any fantasy football player. While I am more than a casual football fan, I am also a data-driven expected value thinker. I don’t presume to beat the stock market or know more than the odds makers in Vegas, so I also don’t presume to know more than the ESPN projected scores in fantasy football. When I make my lineup decisions, I therefore aim to maximize my projected score. However, this year, I got off to a rocky 1–4 start in the league. I began to question my strategy of relying on the ESPN projections.

I wanted to know: how accurate is the ESPN fantasy football projected score? In this blog, I will outline my data science approach to answering this exact question. I will use methods to assess the accuracy of the model as if it were a regression model and then as if it were a binary classification model.

Data

I use data from my own fantasy football league. We had a 10-team league and played an 18-week season. So, there are 180 projected and actual scores from around the league. For the sake of anonymity, I changed the team names to team_1, team_2, …, team_10. Additionally, I treated playoff games the same as regular season games. In our league each playoff game was two weeks long, however, for the sake of the analysis, I treated each playoff week as a separate game. Also, our league did no stat corrections for the suspended/canceled Bills-Bengals game from Week 17.

We can read the data set I created as a pandas data frame and return the first five rows using the head() function. There are rows for the week of the season, the team, the team’s projected score, the team’s actual score, the difference between the projected and actual score, and the opposing team. Here is what that looks like:

import numpy as np
import pandas as pd
fantasy_football_df = pd.read_csv('./Fantasy_Football_Proj_Actual.csv')
fantasy_football_df.head()

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (4)

Comparing Projected and Actual Scores: A Regression Model Approach

Now that we have the data loaded, we will treat the projected score as the output from a regression model. In machine learning, a regression model is a form of supervised learning model which aims to predict a continuous numeric value. We know that the ESPN overall projected score is a sum of the projected scores of the individual players, so it fits this mold.

First, let’s get summary statistics of the projected and actual scores as well as the difference between the two:

fantasy_football_df.loc[:,['Projected', 'Actual', 'Delta_proj_minus_actual']].describe()

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (5)

There are a few immediate important take aways:

The mean of the projected scores is slightly higher than that of the actual scores meaning that, in expectation, the underlying regression model overpredicts scores.
The standard deviation of the actual scores is much higher than that of the projected scores. This is expected because the projected scores are a sum of many expected values.
In a similar vein to point 2, we see that the min and max actual scores are more extreme than the min and max projected scores.

Using plotly, we can visualize the distribution of the data:

import plotly.figure_factory as ff# Group data together
hist_data = [fantasy_football_df['Projected'].values, fantasy_football_df['Actual'].values]
group_labels = ['Projected', 'Actual']
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=5)
fig.show()

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (6)

The figure is incredibly informative because we can see a histogram, a continuous distribution, and a frequency distribution all in one. The difference in the distribution shape is striking and confirms that there is much more randomness in actual fantasy scores than the projected scores. Also, the actual scores trend slightly lower than the projected scores. The difference in distribution variance should not be surprising to fantasy football players. From week to week, players can overachieve or get injured 5 minutes into the game and never see another down of play time. After all, that is why ESPN also provides boom and bust stats for individual players. Predicting when the booms and busts occur would be much more difficult, but in expectation, the distributions match pretty closely.

Using the distfit package in Python, we can fit and compare many different distributions to match the Actual and Projected data. (More on that in my post here.) For this blog post, I will skip to the results. The ESPN projected scores fit incredibly well to a loggamma distribution:

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (7)

Meanwhile. the actual scores fit well to a generalized extreme value distribution:

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (8)

We can also plot the difference between projected and actual score:

import plotly.figure_factory as ff# Group data together
hist_data = [fantasy_football_df['Delta_proj_minus_actual'].values]
group_labels = ['Projected - Actual']
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=5)
fig.show()

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (9)

Here, we see the distribution is left skewed. There are several outliers on the left side of the graph representing weeks of severe underachievement. Overall, the message is that the projected score seems to be a decent predictor for purposes of setting your lineup and can be trusted within a +/- 20-point range.

Common Regression Accuracy Metrics

We can visualize the error using scatter plots, as well. For example, here we build a scatter plot of all the values of projected — actual scores over the entire season for all teams.

fantasy_football_df['Raw Error'] = fantasy_football_df['Delta_proj_minus_actual']import plotly.express as px
import plotly.graph_objects as go
# Create figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.linspace(50,200), y=np.linspace(50,200), mode='lines', line=dict(color="black"),
 showlegend=False))
fig.add_trace(go.Scatter(x = fantasy_football_df['Actual'], y=fantasy_football_df['Projected'], name = 'Proj - Actual',
 mode='markers',
 marker=dict(size=8, color=fantasy_football_df['Raw Error'], colorscale='Jet', showscale=True)))
fig.update_layout(xaxis=go.layout.XAxis(title=go.layout.xaxis.Title(text="Actual")),
 yaxis=go.layout.YAxis(title=go.layout.yaxis.Title(text="Projected")))
fig.show()

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (10)

From the plot, we can confirm that the standard deviation of actual scores was much greater than that of projected scores. There were several weeks in which the projected score was close to perfect, but a majority of the projected scores were overpredictions.

Now, we need a way to quantify the accuracy of the ESPN projected score model over all predictions. There are several canonical metrics including Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percent Error (MAPE).

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (11)

We can easily calculate the error metrics and print them in Python:

RMSE = round(np.sqrt(np.mean(fantasy_football_df['Raw Error']**2)), 2)
MAE = round(np.mean(fantasy_football_df['Absolute Error']), 2)
MAPE = round(np.mean(fantasy_football_df['Absolute Error']/fantasy_football_df['Actual'])*100, 2)
print('ESPN Projected Score Accuracy:\nRMSE = '+str(RMSE)+'\nMAE = '+str(MAE)+'\nMAPE = '+str(MAPE)+'%')

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (12)

These metrics allow us to describe the performance of the model in both absolute and relative terms. While the original histograms showed a mean error between of around 2, it was accounting for over- and underpredictions. However, the RMSE, MAE, and MAPE treat over- and underpredictions the same. For example, the mean error of -10 and 10 would be 0 whereas the MAE or RMSE of -10 and 10 would be 10. We can see that on average, the ESPN projection model has an error of around 18–23 points which is about 18% over or under the actual score.

Evaluating Error by Week

As anyone who has ever played fantasy knows, some weeks seem to feel like every player in the league is booming while other weeks are busts across the board. So, I wanted to know if the data from my league matched my gut feeling.

Using a box plot, we can visualize the quartiles of the data. The box represents the interquartile range (25–75th percentile), and the middle line of each box represents the median.

import plotly.express as px
import plotly.graph_objects as go
# Create figure
fig = px.box(fantasy_football_df, x="Week", y="Raw Error", color = 'Week')
fig.add_trace(go.Scatter(x = fantasy_football_df['Week'], y=np.zeros(len(fantasy_football_df['Week'])), mode='lines', line=dict(color="black"),
 showlegend=False))
fig.show()

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (13)

There are several weeks (2, 9, 12, 15) where the median difference between projected and actual scores is nearly zero. Meanwhile, weeks like week 1, 3, 14, 17, and 18 show overprediction across the league by ESPN’s model and week 8 shows large underprediction. For most weeks, the projected league score falls within the interquartile range of the actual league score (in other words, the black line on the graph passes through the box).

Evaluating Error by Team

Perhaps the most anticipated insight I provided to my league each week was a boxplot showing the distribution of the difference between projected and actual scores by team. I highly recommend this simple plot for your league next year. Over the course of the season, most people’s median error grew closer and closer to zero, but by tracking the plot week over week, it was easy to get a sense of who was caught in a boom-bust cycle and who was more consistent. This can be a great visualization to help all the teams in your league assess whether their lineup strategy is performing as expected. Here’s the plotly code:

import plotly.express as px
import plotly.graph_objects as go
# Create figure
fig = px.box(fantasy_football_df, x="Team", y="Raw Error", color = 'Team')
fig.add_trace(go.Scatter(x = fantasy_football_df['Team'], y=np.zeros(len(fantasy_football_df['Team'])), mode='lines', line=dict(color="black"),
 showlegend=False))
fig.show()

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (14)

As you can see from the plot, there was only one team that consistently outperformed projections: team_2. This team was the eventual league champion. The most predictable team was team_7 while the least predictable team was team_3. While the projected score can be important for lineup decisions what is ultimately important in fantasy football is winning.

Evaluating ESPN Projected Score as a Classification Model

When playing fantasy football, what matters most is winning. Indirectly, the ESPN projected score serves as a classification model by comparing two projected scores and predicting the likely winner. Therefore, we will take this opportunity to evaluate the projections as a binary classification model predicting W/L for each team.

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (15)

First, I start fresh in a new notebook, and upload the data adding new dummy columns to track some important Vegas-type results including straight up (S/U) measured by comparing the projected W/L against the actual W/L result, against the spread (ATS) by comparing the projected difference in scores between two teams in a game and the actual difference in scores, and the over/under (O/U) by comparing the sum of the projected scores to the sum of the actual scores of a game.

import numpy as np
import pandas as pd
fantasy_football_df = pd.read_csv('./Fantasy_Football_Proj_Actual.csv')#Set up new columns
fantasy_football_df['Projected team_i_minus_team_j'] = 0
fantasy_football_df['Actual team_i_minus_team_j'] = 0
fantasy_football_df['Projected Result'] = 'L'
fantasy_football_df['Actual Result'] = 'L'
fantasy_football_df['ATS Result'] = 'L'
fantasy_football_df['Game Over Under'] = 'Under'
# For every row in the data, determine the result of the projected, actual, ATS, and O/U results of the game
for i in range(len(fantasy_football_df)):
 week = fantasy_football_df['Week'].values[i]
 #Get projected and actual score of first team
 proj_score_team_1 = fantasy_football_df['Projected'].values[i]
 actual_score_team_1 = fantasy_football_df['Actual'].values[i]
 #Get projected and actual score of opponent
 opponent = fantasy_football_df['Opponent'].values[i]
 proj_score_team_2 = fantasy_football_df.loc[(fantasy_football_df['Week'] == week) & (fantasy_football_df['Team'] == opponent), 'Projected'].values[0]
 actual_score_team_2 = fantasy_football_df.loc[(fantasy_football_df['Week'] == week) & (fantasy_football_df['Team'] == opponent), 'Actual'].values[0]
 #Determine the projected and actual results
 fantasy_football_df.loc[i, 'Projected team_i_minus_team_j'] = proj_score_team_1 - proj_score_team_2
 fantasy_football_df.loc[i, 'Actual team_i_minus_team_j'] = actual_score_team_1 - actual_score_team_2
 #Update labels
 # S/U
 if (proj_score_team_1 - proj_score_team_2) > 0:
 fantasy_football_df.loc[i, 'Projected Result'] = 'W'
 if (actual_score_team_1 - actual_score_team_2) > 0:
 fantasy_football_df.loc[i, 'Actual Result'] = 'W'
 #ATS
 if (proj_score_team_1 - proj_score_team_2) < (actual_score_team_1 - actual_score_team_2):
 fantasy_football_df.loc[i, 'ATS Result'] = 'W'
 #O/U
 if (proj_score_team_1 + proj_score_team_2) < (actual_score_team_1 + actual_score_team_2):
 fantasy_football_df.loc[i, 'Game Over Under'] = 'Over'
 if (proj_score_team_1 + proj_score_team_2) == (actual_score_team_1 + actual_score_team_2):
 fantasy_football_df.loc[i, 'Game Over Under'] = 'Line'

Now, let’s compare each team’s performance over the season in a simple bar chart using plotly.

import plotly.graph_objects as gofig = go.Figure(data=[
 go.Bar(name='Projected Wins', x = list(np.unique(fantasy_football_df['Team'])), 
 y = list(fantasy_football_df[fantasy_football_df['Projected Result'] == 'W'].groupby('Team').count()['Projected Result'].values)),
 go.Bar(name='Actual Wins', x = list(np.unique(fantasy_football_df['Team'])), 
 y = list(fantasy_football_df[fantasy_football_df['Actual Result'] == 'W'].groupby('Team').count()['Actual Result'].values)),
 go.Bar(name='ATS Wins', x = list(np.unique(fantasy_football_df['Team'])), 
 y = list(fantasy_football_df[fantasy_football_df['ATS Result'] == 'W'].groupby('Team').count()['ATS Result'].values)),
 go.Bar(name='Hits the Over', x = list(np.unique(fantasy_football_df['Team'])), 
 y = list(fantasy_football_df[fantasy_football_df['Game Over Under'] == 'Over'].groupby('Team').count()['Game Over Under'].values))
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.add_trace(go.Scatter(x = list(np.unique(fantasy_football_df['Team'])), 
 y=np.zeros(len(list(np.unique(fantasy_football_df['Team'])))) + 9, mode='lines', 
 line=dict(color="black"), name = '0.500', showlegend=True))
fig.show()

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (16)

The ESPN projection model only predicted four teams to break 0.500: team_1, team_2, team_4, and team_10. The five teams that did break 0.500 look slightly different: team_2, team_3, team_4, team_5, and team_8. So, despite a relatively accurate projected score, the projected game result was much less consistent. If you play fantasy, this should be a reasonable finding, as you know that each week the difference between each team’s projected points (aka the spread) is usually in the single digits. Only team_1, team_5, team_7, and team_8 had winning records ATS. However, three other teams went 0.500 ATS. Finally, as expected from our previous analysis, hitting the Over was a rare occurrence league wide (except for team_2). We anticipated this as we saw that the ESPN projected scores usually overpredicted a team’s actual score in the league.

Common Classification Accuracy Metrics

Just like with the regression approach, let’s dig a little deeper into the accuracy of the ESPN projections as a classification model. There are several canonical metrics which we will explore: accuracy, recall, specificity, precision, and F score. Each of these metrics relies on identifying projected labels as true positive, true negative, false positive, or false negative. In this case a Win is considered positive, and a Loss is considered negative.

Accuracy is defined as the number of correct predictions over all possible predictions, recall (aka sensitivity) is the fraction of positive predictions that are correctly identified, specificity is the fraction of negative predictions that are correctly identified, precision (aka positive predictive value) is the fraction of labels that are predicted positive that are truly positive, and F score is the harmonic mean of precision and recall. Mathematically:

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (17)

We can track all of these metrics for each team in the league to see how well the ESPN fantasy football projections predict a win. The code is as follows:

# Create lists to track W/L metrics
teams_list_for_df = ['team_1', 'team_2', 'team_3', 'team_4', 'team_5', 'team_6', 'team_7', 'team_8', 'team_9', 'team_10']#list(np.unique(fantasy_football_df['Team']))
projected_wins_list_for_df = []
projected_losses_list_for_df = []
actual_wins_list_for_df = []
actual_losses_list_for_df = []
ats_wins_list_for_df = []
ats_losses_list_for_df = []
over_list_for_df = []
under_list_for_df = []
tp_list_for_df = []
fp_list_for_df = []
tn_list_for_df = []
fn_list_for_df = []# Loop over all teams
for team in teams_list_for_df:
 # Get number of projected wins
 projected_wins_list_for_df.append(np.unique(fantasy_football_df.loc[fantasy_football_df['Team'] == team, 'Projected Result'], return_counts = True)[1][1])
 # Get number of projected losses
 projected_losses_list_for_df.append(np.unique(fantasy_football_df.loc[fantasy_football_df['Team'] == team, 'Projected Result'], return_counts = True)[1][0])
 # Get number of actual wins
 actual_wins_list_for_df.append(np.unique(fantasy_football_df.loc[fantasy_football_df['Team'] == team, 'Actual Result'], return_counts = True)[1][1])
 # Get number of actual losses
 actual_losses_list_for_df.append(np.unique(fantasy_football_df.loc[fantasy_football_df['Team'] == team, 'Actual Result'], return_counts = True)[1][0])
 # Get number of ATS wins
 ats_wins_list_for_df.append(np.unique(fantasy_football_df.loc[fantasy_football_df['Team'] == team, 'ATS Result'], return_counts = True)[1][1])
 # Get number of ATS losses
 ats_losses_list_for_df.append(np.unique(fantasy_football_df.loc[fantasy_football_df['Team'] == team, 'ATS Result'], return_counts = True)[1][0])
 # Get number of games hitting Over
 over_list_for_df.append(np.unique(fantasy_football_df.loc[fantasy_football_df['Team'] == team, 'Game Over Under'], return_counts = True)[1][0])
 # Get number of games hitting Under
 under_list_for_df.append(np.unique(fantasy_football_df.loc[fantasy_football_df['Team'] == team, 'Game Over Under'], return_counts = True)[1][1])
 # Get number of projected Ws that are Ws
 tp_list_for_df.append(len(fantasy_football_df.loc[((fantasy_football_df['Projected Result'] == fantasy_football_df['Actual Result']) & (fantasy_football_df['Team'] == team) & (fantasy_football_df['Projected Result'] == 'W')), 'Projected Result']))
 # Get number of projected Ws that are Ls
 fp_list_for_df.append(len(fantasy_football_df.loc[((fantasy_football_df['Projected Result'] != fantasy_football_df['Actual Result']) & (fantasy_football_df['Team'] == team) & (fantasy_football_df['Projected Result'] == 'W')), 'Projected Result']))
 # Get number of projected Ls that are Ls
 tn_list_for_df.append(len(fantasy_football_df.loc[((fantasy_football_df['Projected Result'] == fantasy_football_df['Actual Result']) & (fantasy_football_df['Team'] == team) & (fantasy_football_df['Projected Result'] == 'L')), 'Projected Result']))
 # Get number of projected Ls that are Ws
 fn_list_for_df.append(len(fantasy_football_df.loc[((fantasy_football_df['Projected Result'] != fantasy_football_df['Actual Result']) & (fantasy_football_df['Team'] == team) & (fantasy_football_df['Projected Result'] == 'L')), 'Projected Result']))
 # Create data frame
team_wins_df = pd.DataFrame(list(zip(teams_list_for_df, projected_wins_list_for_df, projected_losses_list_for_df,
 actual_wins_list_for_df, actual_losses_list_for_df, ats_wins_list_for_df,
 ats_losses_list_for_df, over_list_for_df, under_list_for_df,
 tp_list_for_df, fp_list_for_df, tn_list_for_df, fn_list_for_df)), 
 columns = ['Team', 'Projected Wins', 'Projected Losses', 'Actual Wins', 'Actual Losses',
 'ATS Wins', 'ATS Losses', 'Games Over', 'Games Under', 'TP', 'FP', 'TN', 'FN'])
# Calculate Accuracy Metrics
team_wins_df['Accuracy'] = (team_wins_df['TP'] + team_wins_df['TN'])/(team_wins_df['TP'] + team_wins_df['TN'] + team_wins_df['FP'] + team_wins_df['FN'])
team_wins_df['Recall'] = (team_wins_df['TP'])/(team_wins_df['TP'] + team_wins_df['FN'])
team_wins_df['Specificity'] = (team_wins_df['TN'])/(team_wins_df['TN'] + team_wins_df['FP'])
team_wins_df['Precision'] = (team_wins_df['TP'])/(team_wins_df['TP'] + team_wins_df['FP'])
team_wins_df['F Score'] = (2*team_wins_df['TP'])/(2*team_wins_df['TP'] + team_wins_df['FP'] + team_wins_df['FN'])

We can output the resulting data frame to see the data from the entire league at once.

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (18)

Alternatively, we can create a simple scatter plot for easier visualization:

import plotly.express as px
import plotly.graph_objects as go
# Create figure
fig = go.Figure()
# Add scatter plots to see how teams compare on different metrics
fig.add_trace(go.Scatter(x = team_wins_df['Team'], y=team_wins_df['Accuracy'], mode='markers', name = 'Accuracy',
 marker=dict(size=8, color = 'black')))
fig.add_trace(go.Scatter(x = team_wins_df['Team'], y=team_wins_df['Recall'], mode='markers', name = 'Recall',
 marker=dict(size=8, color = 'blue')))
fig.add_trace(go.Scatter(x = team_wins_df['Team'], y=team_wins_df['Specificity'], mode='markers', name = 'Specificity',
 marker=dict(size=8, color = 'red')))
fig.add_trace(go.Scatter(x = team_wins_df['Team'], y=team_wins_df['Precision'], mode='markers', name = 'Precision',
 marker=dict(size=8, color = 'green')))
fig.add_trace(go.Scatter(x = team_wins_df['Team'], y=team_wins_df['F Score'], mode='markers', name = 'F Score',
 marker=dict(size=8, color = 'orange')))
fig.add_trace(go.Scatter(x = team_wins_df['Team'], y=np.zeros(len(team_wins_df['Team']))+0.5, mode='lines', line=dict(color="black"),
 showlegend=False))fig.show()

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (19)

In general, we can see that the metrics are varied throughout the league. For all but team_3 and team_10, however, we can see that accuracy was better than 50% and therefore better than a simple coin flip. This provides evidence of the utility of the ESPN projected scores in predicting fantasy game outcomes. The projections perform the worst with respect to specificity meaning that the prediction of a loss for most teams was least reliable.

Limitations

This analysis was restricted in scope to the data I collected from my 10-team fantasy football league. I suspect over the millions of fantasy football teams in the ESPN fantasy universe, the projected score performs even better as a regression model. However, the accuracy as a binary classification model was not quite as good. It largely performed better than a coin flip across multiple metrics but is not reliable enough to go into the week feeling comfort in a projected win. In fairness to the ESPN model, the app outputs a probability of winning rather than a binary output. Usually, probabilities of winning between two teams are fairly close — 45–55%.

Conclusions

The data seems to suggest that the ESPN fantasy football projected scores are reliable within about +/-15%, and the W/L prediction is accurate more often than not. That being the case, I plan to continue my strategy of maximizing my projected score. It might not always work out for me, but in expectation, it is a great indicator.

All the code and data can be found on GitHub here: gspmalloy/fantasy_football (github.com)

Follow me on Twitter: @malloy_giovanni

I would love to hear from other League Managers. Is this something you plan to implement? Please comment below with your thoughts or experiences!

ML model evaluation: Measuring the accuracy of ESPN Fantasy Football projections in Python (2024)

References