March Madness Machine Learning, the State of BYU Basketball and 2011 — long
Hey Cougarboard, I’m a BYU student studying economics and math wanting to go into data science and machine learning. I completed a project this week that I think might be interesting given the recent Tournament, the state of BYU basketball and personally, a belief that Jimmer was destined for the Final Four, but sadly never got there.
So here we go, I will try to not be overly technical, but still be detailed in the analysis, gosh I feel like my dad posting on this thing lol (and if this is too long and boring, believe me I get it).
THE PROBLEM STATEMENT: Can we teach the computer to predict upsets in March Madness
THE ANSWER: Yes. I believe that statistical analysis can provide deep insight into real Tournament success
THE DATA: Our premise utilized regular season data to predict postseason outcomes. We used a team’s seed in that year’s Tournament, along with that team’s historical seeding, general conference rankings (was a team P5/6 depending on the year or outside that group) and season averages for several different statistical categories to then predict the outcomes of games. This regular season data required some serious manipulation. Specifically, we had game by game data which gave us the points, rebounds, assists, fouls, turnovers, blocks, ect for both the winning team and the losing team. So we had to take the average of all of these columns based on who won and lost, then match those averages with their TeamID in the Tournament and weigh their averages based on their records... It was quite a lot of work, and we had to create basically every feature we fed into our model.
THE MODEL: We ran three models, all of which utilized decision treds. Briefly, a decision tree, akin to a flow chart, is a way for you to subset data so you get clean data and can pick an outcome. Because the favorite wins about 70% of the time, we thought about synthetically creating extra upsets, so the machine wouldn’t just pick the favorite. This model wasn’t very interesting because it would only pick 9 overs 8 every once in a while.
We had another model that would penalize itself if it got an upset incorrect, this is called Gradient Boosting, but this model was too risk averse, literally only picking the favorite.
The model that we went with was the simplest, just a simple Random Forest, which is where you create lots of trees and pick with whichever outcome got more votes. We would train this model on 10 random seasons from 2003 to 2018, and then validate on 5, continuing to adjust features until we got a result we really liked. We then ran it on the 2019 data.
RESULTS: The results of our modeled bracket for the 2019 Tournament are in the slideshow attached below. Our model predicted 13/16 Sweet 16 teams, 6/8 elite 8 teams, 3/4 final four, both Texas Tech and Virginia in the championship game, but our model predicted Texas Tech to win it all. It is worth noting that the championship game, the New Mexico State v Auburn, and St Mary’s vs Villanova, were all picked as upsets but were incorrect. However, each of these games could have gone either way, the championship game went to overtime, St Mary’s lost by 4 and NMST lost by 1 and had two chances to win the game. So the games we got wrong, in large part, were typically very tight.
Our model would have earned 1290 points on ESPN good for the 98th Percentile. If we had gotten the championship game right, we would have been 99.7 percentile and if NMST would have won we would have been 99.9 or higher.
TAKEAWAYS: We wanted to see which features (non machine learning people would say they’re statistics) of our model the computer said were the most important in predicting upsets in the Tournament, that is to say which variables did it rely on the most when deciding which teams would advance in any given game.
What Mattered: Our model weighted seeding very heavily, both the seed given from 1-16 and the power seed (teams ranked 1-64). Given that most of the games are played in the first round, where seeding matters a lot more, our model relied on seeding most heavily during the first round. The features that mattered most in the later rounds were actually really interesting.
The next most important group of factors to determine the success of any given team in the Tournament were (a) average points given up by a team in the regular season, (b) total number of defensive rebounds, and (c) number of turnovers forced. This predictive power, (which is different than correlation), between winning in March and fundamental defensive strength was a bit of a sanity check seeing as both Texas Tech and Virginia had top defenses this year. It also provides insight as to why BYU has failed to qualify recently and has traditionally fared very poorly in the Tournament since 2003, with only 3 victories.
To further expand on this point. BYU has been abysmal on the defensive end of the court in recent memory. This year BYU was 169th in opponents field goal percentage, surprisingly they were in the top 70 in defensive rebounds, (which was one of the features we found to be among the more important predictors of Tournament success) and they were 139th in creating turnovers. By comparison, UVU was 108th in defensive field goal percentage, 285th in creating turnovers, and 69th in defensive rebounding. Honestly, none of these numbers from Coach Pope’s team stand out to me. His team was below average on defense in a below average conference. It is hard for me to get on the Pope train because I don’t see him as succeeding at BYU without an absolute emphasis on and ability to teach and implement fundamentally successful defense which he did not come close to doing at UVU.
What Didn’t Matter: Where we found that seeding and defensive features really mattered, features that were generally not predictive of Tournament success included such things as historical seeding, P5/6 affiliation (except to the extent that it may influence seeding) and all offensive categories (total points scored, 3-point shooting percentage, free throw attempts to name a few). So you can draw your own conclusions here, our model would not be favorable to any coach that believes his team can outscore his opponents and see success in the Tournament.
WHAT ABOUT 2011: When BYU had Jimmer his senior year, they were 4th in the country in defensive rebounds, 36th in opponents turnovers (thank you Jackson Emery), 99th in opponents field goal percentage, and although I did not have all the data in order to create defensive efficiency, our best estimate puts BYU at 32nd in defensive efficiency that year, as compared to 217th this year while UVU was 132nd.
Mentioning that 2011 BYU team, I have said my whole life that the 2011 team with Jimmer Fredette and Brandon Davies would have made the Final Four, but have never had a way to test this theory. So we ran our model on the 2011 Tournament with BYU as the 3 seed in the Southeast region. You may remember that BYU lost to Florida in the Sweet 16, however, our model predicted that a full strength BYU would beat Florida and advance to the Elite 8 and then beat Butler to make the Final Four. In the semifinal game our model predicted a win against VCU putting BYU into the finals where our model predicted they would have lost to UConn and Kemba Walker in the championship game. So that was a really fun thing to run and I now have data to backup my belief that the 2011 BYU team was a special one.
A small aside: My dad has said for a while that the 2004 BYU team that got a 12 seed was under seeded and should have beaten Syracuse if it had not been for some kid having the game of his life. According to our model BYU should have lost that game — sorry dad.
Future Improvements on the Model: We would like to include points per possession because we have play-by-play data from every regular season game since 2003, but I would need to learn how to do Natural Language Processing which remains a work in progress. Additionally, we could also include a variable for if the team has a projected top 10 NBA draft pick, as well as using the Page Rank algorithm from Google to see how the current win / losing streak effects a team heading into the Tournament. We could also looking into clustering teams who made the Sweet 16 to look for commonalities and clusters there to see what those teams look like.
When I was describing this project to one of my math professors, she mentioned to me that I could use Page Rank and configure it as a feature. What it does is it says BYU beat x team, x team beat y team and y team beat z team, so it has a way of ranking wins. When it does this it creates a matrix that has all of the teams as rows and columns and a 1 if they played. So the matrix gets very big very fast. In order to do the algorithm, you have to normalize your Eigan vectors and then do some more linear algebra with those. It wasn’t too difficult to write, but I only had data since 2010, so I only did it for 2011. UConn had the highest page rank, Kentucky (who made the Final 4) was #3 and BYU was #6. So the other thing that will probably need to improve is BYU’s scheduling.
That was my project, here is a slide show we used when we presented.