I convinced a couple friends to join me in a quest to wield Machine Learning to predict hockey player performance. It was a promising idea, but I needed their programming help and hockey insight to make it a reality. Knowing the outcome of the season before it’s played would assure success for our fantasy hockey teams! Just throw a bunch of stats into the black box of an ML algorithm and… poof! Championships! Right? We weren’t that naive (but still: naive). And at least we were aware of our naivety. Our combined experience with ML was 0.00, so if nothing else, we’d have infinitely more experience by the end.
Expectations were sane. We would generate and evaluate a prediction for a coming season’s performance based on the performance in the previous seasons. To do this, we needed:
- Data! NHL player statistics recorded by season.
- ML algorithm that uses the player stats to predict future production.
- An approach to quantify the accuracy of the predicted production.
If our objective was to race a car, we needed an engine/drivetrain (algorithm), fuel (data), and a speedometer (quantification). For us, success would be defined by simply getting this beater around a track. No expectations of setting a new track record here. Not yet.
I took on the machine learning aspect of the project. Some online reading (including particularly useful articles by Jason Brownlee, such as this one) suggested that, as a multivariate time-series prediction problem, the Long Short-Term Memory (LSTM) algorithm would be well-suited for the task. The fact that LSTM is one of the algorithms available through the Python Keras library sealed the deal — LSTM would make our predictions.
A key aspect of LSTM that makes it attractive for this problem is that the algorithm can use multiple time steps as input for training and prediction. Previous performance indicates (to some extent) future production in the NHL, so we want to include as much information as possible.
For this project, we aimed to predict a player’s performance for a single statistical category for a single step forward in the series. I know that the problem could have been framed to predict multiple steps, and I suspect there was also a way to predict multiple stat categories in one go, but those weren’t in-scope for us.
Like other Recurrent Neural Networks, LSTM has a number of hyperparameters to set. These parameters control the rate at which the ML model training occurs and the quality of predictions that can be made by the model. I found some parameters that did a reasonable job, but they are by no means optimized. Tuning this ML algorithm wasn’t part of this project — remember — we’re not building an F1 racer! Just something to get us around the track.
When it comes to data, one does not simply load into LSTM. There is data, and then there is data. Our ML car needed something to power it. We had some crude oil, but it had to be refined before we could fuel up.
Neal Harder (first recruit) retrieved statistics from NHL.com using a scraping algorithm. Basically, he took data like this and placed it into a table in an SQLite3 database. It worked like magic, but I don’t doubt that there’s some trickiness here that I don’t appreciate. I’ve invited Neal to share his Python code, perhaps attached to a blog post (nudge, nudge, Neal).
Knowing the type of engine we were running, we refined our data accordingly. The process of transforming the data into something to fuel the algorithm is not a trivial thing. Understanding what the algorithm required and how to generate it was the most
difficult enlightening part of this project (for me, at least). Let’s have a look at what refined product looks like.
Understanding what the algorithm required and how to generate it was the most
difficultenlightening part of this project
The basic unit of our ML data is akin to one row from our table of raw stats (think of that table on NHL.com). The row records one player’s statistics for one year. The columns are any statistics that might be insightful for the statistic that is to be predicted. A row is often called a feature vector: a multi-dimensional numerical representation of a player’s performance. Selecting stats to include in the model is important, so I did some analysis by following a fantastic notebook from Matteo Niccoli. Though that analysis isn’t part of this blog post (or accompanying notebook), those insights have influenced the statistics included as ML input. For this analysis, the statistics used to predict points were: year, player ID, games played, goals, assists, points, power play points, shots, and time on ice per game.
In addition to the raw data, one other statistic was added to each feature vector: a lagged stat. This is a key process that frames the problem for the ML algorithm.
Imagine that we want the value q. We’re not sure what q is, but we think we can calculate it with a function F, and F depends on the variables x, y, and z. This can be written as:
q = F (x,y,z)
In this equation, x, y, and z are independent variables, and q is the responding variable. We can map this to our hockey forecasting problem, where the independent variables (x, y, z) are statistical results for a season, the responding variable (q) is the performance in a stat category for a future season, and F is the ML algorithm. In this case, the responding variable comes from a different time than the independent variables, i.e. there is a time gap, or lag between them.
Remember that LSTM can handle multiple time steps? We constructed data sets with lags of one, two, and three years to represent the time steps to the algorithm. Each lag exists as its own table of rows and columns, and then these tables are “stacked” on each other to create a 3D array.
Building the 3D array was the biggest stumbling block I faced in this project. It took me a while to realize that I needed to maintain some connectivity between the layers of the array. Think about one layer of the array — a single lag. In such a table, every entry in a row belongs to the same player in the same year. The connection between these variables is implicit in their location on the table. The connection between lags must also be maintained: row i in a layer differs from row i in another layer by only the lag. Not the easiest concept to explain in words… I expect that if you’re truly interested, you’ve already found the accompanying notebook in the GitHub repo, so you can let the code explain it to you.
Ultimately, the data array is built by repeatedly pulling data from Neal’s SQLite database. I found this to be a much easier approach than shuffling data around a Pandas dataframe. Don’t get me wrong here, Pandas is great! It’s my Pandas skills that are lacking.
Quantifying forecast quality
Adam Gignac (second recruit) joined the project and took on the task of quantifying the performance of the forecasts. This was done a couple ways.
We can quantify the quality of the forecast using the dataset itself. For example, use data from 2006-2010 to predict performance in the 2011-2012 season, then compare the prediction to the actual performance for that year. Calculating the Root Mean Square Error (RMSE) between the predicted and actual production provides a quantitative measure of quality. It’s a simple approach that’s simply applied. Not having coded anything since the series finale of Friends, Adam reacquainted himself with the fun of scripting by creating his own RMSE function. It was a great exercise to learn Python syntax (and appreciate the convenience of NumPy!).
Quantification is great in that it provides a number, but it’s not really useful without some other numbers to compare with. How does our forecast compare to others? What’s the baseline? Is there a benchmark for “good” or “great”?
To define a baseline, we use a very simple method: future production will be the same as last season’s production. Connor McDavid scored 48 points in 2015-2016, so this basic method predicts he should score 48 points in 2016-2017 as well! We hope ML can do better than this.
To define forecast benchmarks, we look to the professionals. Before every season, a number of hockey prognosticators publish their predictions for the coming year. I expect these incorporate some statistical analysis, but I wouldn’t be surprised if some were completely generated in the minds of well-informed experts. They must do a (passable) job, because those predictions are published year after year. Since they are the professionals, we use their predictions as benchmarks. Achieving results on par with these analysts was the stretch target for our venture. Would matching these benchmarks demonstrate how great ML is? Or how poor the professional analysts are? Or just how difficult it is to predict something like sports?
There is likely a way to determine an maximum achievable correlation coefficient that is a function of the innate variability of the input data. This would be a ceiling on the accuracy of the predictions, and perhaps the ultimate prediction benchmark. That’s beyond the scope of this project, though.
Looking only at RMSE as a measure of prediction quality can be misleading, so we plotted out some scatter plots to compare predicted to actual values for the different predictions. A point on the 1:1 line is a perfect prediction: the predicted value is the same as the actual one. Points above the line are over-predicted: the predicted value is higher than actual. Points below the line are under-predicted.
An important point to make about ML forecasts: results will vary. This is a feature, not a bug! Using the same data to train a ML model can produce different forecast results. Studying the range of results from multiple predictions can be insightful, but it’s out-of-scope for this project. I mention it now because the scatter plot shows one forecast set – not our best, but not our worst either. If you run the code for yourself, the results will be slightly different.
The scatter plots show that our ML approach creates a prediction! Values are within the expected range, and there’s even a positive correlation with actual values. By no means perfect: this race car we’ve built has poor brakes and it gets a little squirrley at high speeds — but she rolls! Achievement unlocked!
The scatter plot shows the ML has some issues. In general, most performances are under-predicted, and there appears to be a systematic over-prediction for low performers. Quantitatively, RMSE value for the ML prediction is the better than the baseline, and on par with the prediction magazines. For mid-range performers, the ML predictions form the tightest scatter of any of the datasets (this is not obvious when the scatter plots are different sizes — sorry!)
Adam undertook a more in-depth analysis of the forecasts to understand where some excelled and others struggled. Segmenting the results into low-, mid-, and high-performers seemed particularly insightful, and would make an interesting blog post (nudge, nudge, Adam).
Upgrading our ride
It’s taken a lot of work to get this far, but there is much more to come to improve and tune our algorithm. Can the results be improved by including more lag steps? What about fewer lags? What other stats provide more insight? Age? Entry draft position? Team(s) played for? What if we predict goals and assists separately? How much improvement can we find by optimizing the hyperparameters or exploring alternative ML algorithms? The next phase of the project may depend on where we’re heading…
Now that we have the keys to the car, it’s time to get out there and hit the road. But where to? How can we apply this ML forecasting? Our work is easily extended to these:
- Go beyond deterministic forecasts to create probabilistic ones (floor – most likely – ceiling).
- Is your fantasy league concerned with stats other than points? Use ML to forecast hits, shots, save percentage, penalty minutes, etc.
I can’t wait to hit the road.
Crossley Chassis image by Rankin Kennedy C.E. [Public domain], via Wikimedia Commons.
“Achievement Unlocked” image generated at http://achievementgen.com/360/