Data Science Project Reflection: What I Learned From the 2021 NFL Big Data Bowl 🏈
Hello :)! I hope this is an informative and fun reflection to read if you are at all interested in Sports Analytics, Data Science, Machine Learning (ML), Python Programming, & AI in Sports. Though a bit long, I keep it organized and link a lot of sources if you are interested in learning more.
Intro. & I Was Pumped for Super Bowl Sunday
Writing this reflection in the spirit of Super Bowl LV (Big Andy Reid Fan), I took time to reflect on what I learned doing the NFL Big Data Bowl 2021 Analytics Competition. My submission is named Defensive Back Domination and has a summary of my outcomes at the top. Though I did not win (I did not expect to), I had fun, practiced data science, and learned alot from and was so impressed by the winning submissions.
This year’s challenge was to use data to help evaluate defensive performance on pass plays, which was definitely challenging to tackle yet important since metrics for measuring NFL defenses lag behind offenses. Check out kaggle for full competition details and NextGenStats, which provides the position and speed of every player on the field during each play. My two goals for this reflection are:
- Document my technical process and aspects to learn from and improve.
- Data Preparation & Exploratory Data Analysis
- Feature Engineering & Visualizations using Matplotlib
- ML Model: XGBoost Decision Trees to Classify Pass Result
- Presenting the Results
In addition to following an iterative Data Science Process and coding in Python, I used libraries numpy for computation, pandas to manipulate data as DataFrames, matplotlib to visualize player tracking data, and Scikit-Learn to implement ML algorithm XGBoost Decision Trees. See my project’s full code here. If I had more time, there are many concepts I would love to continue learning about to utilize.
2. Spark discussions about the evolving applications of AI & Analytics in the NFL— Game Strategy, Player Health & Safety, Sports Betting.
The Big Data Bowl engages the analytics community to help innovate the way football is played and coached. I was excited to contribute to this effort!
Data Preparation & Exploratory Analysis
Preparing the data is the important first step in order to efficiently and accurately analyze it. I did most of work in the DataPrep Python file. Luckily, the datasets for this competition were pretty clean, and .csv files were linked by columns with numerical IDs for game, play, and player. I also did some Exploratory Data Analysis in this step. Here is a summary of what I did:
- Read in raw .csv data files using the pandas library.
- Converted categorical columns (ex. passResult, position) to numerical ones in order to use in exploratory data analysis and ML models.
- Pickled the data to make size more manageable and saved as new .csv files.
- Joined pickled tracking data with player, play, and game data into one file in order to analyze relationships between variables.
- Conducted Exploratory Data Analysis and made a variety of charts such as scatter plots, histograms, and heat maps to understand the content of the 2018 Regular Season NFL Data.
- Normalized the data using two transforms: (1) Normalized tracking data to line of scrimmage x = 0. (2) Normalized defender and football position in tracking data to target receiver at (x,y) = (0,0). Normalizing the data helps ensure that values of numeric columns are on a common scale in case any features have different ranges. Among other things I looked into to prepare data for future projects, I read about impacts of data rescaling, standardization, and normalization on ML.
Feature Engineering & Graphical Visualizations using Matplotlib
Features are inputs to ML algorithms and are usually structured columns in the data. Therefore, I engineered features that characterize the defense and appended these as columns in my code to help improve the performance of the ML model designed to classify pass result using features characterizing the defense. I also made simple 2D graphical data visualizations in Python’s matplotlib library to help see relationships between variables. Here is what I engineered:
- Target Receiver of Pass: Found by splicing the playDescription field in the form of “(Shotgun) N.Foles pass short right to Z.Ertz to PHI 39 for 2 yards (L.Joyner).” Identifing Zach Ertz as the target receiver and Lamarcus Joyner as his defender
- Euclidean Distances (x, y) between the target receiver, football, and every defender of the field at every point in time
I wrote a function that creates 3 subplots of the positions of defenders relative to the normalized target receiver using Hexagonal Bin Plots. The blue dot is the target receiver. Darker red areas represent a greater density of defenders within that region of the target receiver at the play event pass_arrived, meaning that the football just arrived to the target receiver. In the first three plots, the play result is complete [C] while in the next three, the play result is incomplete [I].
What we can see comparing the figures is that for completions, more defenders tend to be behind the target receiver and the direction of the ball than for incompletions. This makes sense since receivers are more likely to catch the ball when open and do not have a defender in between them and the quarterback (QB).
Additionally, to better visualize separation between defender, offender, and the football 🏈, I plotted their positions over time and connected them with solid lines to form a triangle that changes size over the course of the play. Blue is the football, green is the target receiver, and red is the closest defender. Their intitial positions is at the normalized line of scrimmage x = 0.
Matplotlib is a powerful data visualization library in Python, one of the main programming languages for data science. For me, getting good at making 2D graphs in it took a bunch of practice, and I still have lots of room to grow in areas such as customization and style selections.
Getting Started with Matplotlib & Pandas
Knowing how to manipulate data, create beautiful graphs and visualizations, and data storytelling is important in so many industries —data/ sports analytics, finance, sales, digital media/marketing, you name it — to help us make data-driven decisions and get ideas across. Therefore, learning to code in a language such as Python and trying out the pandas and matplotlib libaries is a good start. These are examples of useful technical skills to have as computer science skills are in demand in a variety of industries.
Try Exploring Other Data Visualization Libraries
Matplotlib is great for exploratory data analysis and creating 2D plots but lacks a little bit for formal presentations. Alternatives to matplotlib include other libraries in Python and R such as Plotly, ggplot2, and Bokeh for more interactive visualizations. This article is a good start for reading about pros and cons of data visualization libraries.
ML Model: XGBoost Trees to Classify Pass Result
Artificial Intelligence (AI) VS Machine Learning (ML)
This is one of the longer sections I reflect on, but informative if you are interested in learning more about AI and ML algorithms to apply AI. Check out the Emerging Tech Brew’s Guide on AI for one of the best explanations I’ve seen. It is no secret that AI is booming in a variety of sectors. According to PWC’s 2020 AI Report, COVID-19 has accelerated investment in AI for 52% of surveyed businesses.
ML, a subset of AI, employs algorithms that enable systems to find patterns in big data to make predictions and draw conclusions. Types of ML include deep learning, supervised/unsupervised learning, and reinforcement learning. ML algorithms/models within these types include Neural Networks, Decision Trees, Native Bayes, and Q-Learning just to name a few. Each is better fit for certain types of problems. I just touched on different ML algorithms, but I considered each and their use cases for my project.
Additionally, to learn about advances of ML technology in American football, I did some research. For instance, these articles explore ML to identify Man VS Zone defensive coverage and model QB decision-making, respectively.
- Dutta, R., Ventura, S., Yurko, R. 26 June 2019. “Unsupervised Methods for Identifying Pass Coverage Among Defensive Backs with NFL Player Tracking Data.” Journal of Quantitative Analysis in Sports 16.2 (2020): 143–161.
- Cheong, L., Chi, M., Noori, M., Schaefer, M., Tyagi, A., Zeng, X. “Predicting Defender Trajectories in NFL’s Next Gen Stats.” AWS Machine Learning Blog.
Hardware Requirements & Computation Needed for ML Models
ML models require tons of calculations, and regular consumer hardware may not be able to do them efficiently. Training is the most computationally intensive task. Due to having more cores, GPUs have higher parallelism than CPUs. Therefore, the CPU and GPU faceoff is real.
For my ML model for the 2021 Big Data Bowl, I implemented Gradient Boosted Decision Trees. Since the NFL data was large yet manageable and the task was a bit more intensive, I decided to use a GPU instance on Amazon EC2 and ran my code in a Jupyter Notebook Server on AWS so that it would run more quickly. Other cloud services to consider are Google Cloud and Microsoft Azure. I look forward to gaining more exposure to different cloud services as I work on future projects in the realm of AI.
Choosing an ML Algorithm From So Many: Why I Chose XGBoost Decision Trees For the NFL Big Data Bowl
In order to decide which ML algorithm to use and how to implement with high precision and accuracy , I first defined my problem.
Problem: Classify pass play result as complete or incomplete given training data that characterizes defensive backs and their relationships to the football and the target receiver in-play.
This is a simple classification problem with a binary outcome (complete or incomplete). My input features to the model are defense-centric since this year’s challenge is to use data to help evaluate defensive performance on pass plays.
In order to select the proper ML Classification Algorithm, I did some reading and decided to implement gradient boosted decision trees using XGBoost (Chen and Guestrin 2016), an algorithm I’ve heard of for its success in kaggle competitions. The purpose of gradient boosting is to convert weak learning algorithms such as decision trees into strong ones by optimizing a loss function, a measure of how well the model’s coefficients fit the underlying data. XGBoost is popular for classification, and advantages of using it include its sparsity awareness, built-in cross-validation to avoid overfitting, and many others described here.
XGBoost Python Package Implementation: Summary of Steps
See my full code in Github here. (1) Installed XGBoost. (2) Input data and features for model as Pandas Dataframe.(3) Set test data size to 30% and training to 70%. (4) Loaded data into a DMatrix. (5) Set booster parameters — max depth, eta, objective, number of classes. (6) Set number of training iterations. (7) Trained the model. (8) Classified pass result and made predictions from training and test data to output precision, recall, and accuracy.
Predictions from Training Data: Precision 96%; Recall 74%; Accuracy 90%
Predictions from Test Data: Precision 46%; Recall 39%; Accuracy 79%
So… how did my XGBoost implementation do? Even though the model had an accuracy of 79% to classify pass result on test data (which doesn’t seem that awful), the model clearly overfit the training data. This means that it modeled the training data so well that performance on the test data was negatively impacted. See below for how I would address overfitting next time. It’s okay, I learned from this :).
XGBoost Improvements For Next Time:
There are a bunch, and I touch on them briefly here:
- Minimize Overfitting to more accurately classify test data and pass results. One way is to utilize XGBoost’s built-in early stopping, which works by finding the inflection point where performance on test data starts to decrease and performance on the training data improves as the model begins to overfit.
- Utilize k-Fold Cross-Validation supported in scikit-learn to measure the performance of XGBoost. It splits data into k-parts and you get k different performance scores able to summarized with a mean and standard deviation of classification accuracy.
- Utilize ML Evaluation Metrics for Classification Problems such as the Confusion Matrix, F1 Score, AUC-ROC Curve, and Gini Coefficient. For instance, in AUC-ROC Curve, AUC = Area Under the Curve. In context of my project, a higher AUC says that the model can distinguish between completions and incompletions more accurately.
Presenting the Results
After implementing XGBoost and running out of time before the project was due (lol), I didn’t get to present my results in a visually-appealing or interactive way to tell a more powerful story about the data. However, I explored powerful data visualization libaries methods so that I have these in my toolbox next time.
I was also inspired by some of the ways winners of the 2021 Big Data Bowl presented their results visually. I’d explore them if you have time they are soo cool and well done.
Discussion: Evolving Applications of AI & Analytics in Sports
As I applied ML in my 2021 Big Data Bowl submission, engineers are applying ML to Next Gen Stats aka Player Tracking Data in order to create next-level game-day stats. Some of the ones we see on prime time football include completion probability and hopefully new defensive stats generated from this year’s Big Data Bowl. What’s next? Here is the NFL/AWS AI Playbook available for download to keep up. AWS also powers other sports and leagues (NHL, NASCAR, and others).
Game Strategy for Coaches, Teams, & Players
How can coaches, teams, and players benefit from ML insights via Next Gen Stats while preparing for the next week or while in-game? Currently, the Seattle Seahawks are dubbed the most “tech-progressive and data-driven teams in the NFL” for leveraging AWS cloud and ML services for operational efficiencies and strategic game insights. Through use of a data lake that stores big data in a serverless architecture, the Seahawks hope to improve talent & acquisition, player health & recovery, and prepare for games. Other teams are currently moving in a similar direction as they figure out how to leverage data and AI for competitive edge. I hope the Philadelphia Eagles are next!
Microsoft also has a strong partnership with the NFL and individual teams and provides Microsoft Surface tablets, Surface Sideline Viewing System (SVS) technology, and most recently Microsoft Teams in 2020. However, AWS is the NFL’s official cloud and AI provider. Microsoft on the other hand became the official AI and laptop provider of the NBA in April 2020 in a new multi-year deal, beating AWS and Google. I wonder what kinds of insights teams will be able to access directly from their surface tablets in real-time due to ML advances as well as utilize to prepare for games in years to come.
Player Health & Safety
In addition to using data and AI to enhance fan experience, team infrastructure, and stategy of the game, The NFL is also leveraging it to help keep players safe. For example one recent data-driven kickoff rule change came about in 2018, which resulted in a 38% reduction in kickoff concussions in 2018. Data during the 2015–17 seasons showed that players had about 4x the risk of concussion at kickoff compared to pass or running plays.
These data-driven insights about concussions came about through another analytics competition the NFL sponsors called NFL 1st and Future designed to spur innovation in athlete safety and performance. This year’s data science/AI challenge was one in computer vision to challenge competitors to create ways to detect on-field helmet impacts. See more information about the 2021 winning submissions here.
This is a whole another topic, but an important one as a record 7.6 million Americans were expected to bet online for Super Bowl LV according to the American Gambling Association. This is 63% increase from last year, which is largely due to the increased adoption of legal betting in United States. However, about an hour before kickoff, customers could not access their online Sportsbooks as the servers of DraftKings and FanDuel could not handle the demand. It’ll be interesting to see the potential advances and uses of ML in the world in sports betting in years to come and consider the potential ethical implications of this.
My Final Reflection on this Reflection
I just covered a lot, ranging from my technical data science process to discussions about AI/ML in sports, but had so much fun learning through writing. I hope that you enjoyed reading!