Using Decision Trees to Predict Hall-of-Famers: Part Two
What makes a player great? Predicting Hall-of-Fame pitchers using machine learning.
In our previous article, we used decision trees to predict which eligible players will be inducted into the Hall of Fame in the future based on their batting statistics. This week, we'll do the same thing for pitchers. If you haven't read last week's article, do that first - it has all of the background information needed for this project.
The Problem
Using all of the data available about historical players, can we predict which eligible players will be inducted into the Hall of Fame in the future? In our list of eligible players, we'll include both players that have played within the past five years (and so do not meet the ballot's criteria) and players that have appeared on the ballot but haven't reached the 75% threshold for induction.
Pitching and batting have an interesting relationship in baseball. Except for the occasional Bartolo Colón home run or inning pitched by a position player, the two don't often overlap. To highlight the flexibility of these machine learning techniques, though, we can use decision trees to predict Hall-of-Famers on this dataset, too. In fact, we'll barely have to change any code.
The Data
For the most part, we can treat pitching data like we did our batting data. The Chadwick Bureau Database has all of the stats we'll need, including wins, losses, innings pitched, and earned runs for every pitcher that's stepped on the mound. We'll augment this data by calculating the pitchers' ERA and WHIP. We'll also include some informative stats like All-Star Game appearances, World Series MVP awards, and Cy Young Awards. There are two other important notes: first, we only include players who retired after 1970. This ensures that we're basing our predictions off of the modern game, which is significantly different from early baseball. Additionally, only players who made the majority of their appearances at the pitcher position are considered. That excludes HOF-level batters who may have made a pitching appearance from our dataset.
Last time, we saw how we can use AUC-ROC to better evaluate our model given the imbalanced classes. Because a similarly small fraction of pitchers makes the Hall of Fame, we'll use the same metric again.
Lastly, we have the same regularization techniques in place to prevent overfitting. Our minimum impurity decrease is the same, as is our maximum depth.
Results
First, check out the code used in this article here.
This decision tree looks pretty similar to last week's tree. Interestingly, the root node here splits on All-Star Game appearances, just like last week's root node. Here, the threshold is less than or equal to 5.5 appearances.
If a player has more than 5.5 All-Star Game appearances, they are sorted into the right side of the tree. Already, the majority of these players are Hall-of-Famers, as can be seen from the 'value' field and blue color. The model separates the remaining players by games played (node #4) and losses (node #5). This gives us an idea of what the model is looking for: if a player has pitched a large number of games and compiled enough All-Star appearances, then they're pretty much a lock for the Hall of Fame. If they don't have quite as many appearances, then they need a high winning percentage (and a small number of losses).
On the left side of the tree, the model looks at the total number of career strikeouts. 2,793 strikeouts are required to be classified as a Hall-of-Famer without at least 6 All-Star Game appearances. This threshold is close to one of the major achievements in baseball: 3,000 career strikeouts. HOF voters recognize 3,000 strikeouts as a major career accomplishment, so it's possible our model is picking up on that voting tendency.
The model achieves an AUC-ROC score of 98.02%, and it makes sense on an intuitive level, too. Let's dig into it a little further, though, and see where it makes mistakes. First, we can look at false positives - players that our model predicted would be in the Hall of Fame, but aren't:
| Name | W | L | G | GS | CG | SHO | SV | IPouts | H | ER | HR | BB | SO | IBB | WP | HBP | BK | BFP | GF | R | SH | SF | GIDP | ERA | WHIP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mickey Lolich | 217 | 191 | 586 | 496 | 195 | 41 | 11 | 10915 | 3366 | 1390 | 347 | 1099 | 2832 | 67 | 124 | 92 | 8 | 15140 | 40 | 1537 | 108 | 59 | 90 | 3.438 | 1.227 |
Only one player! That's pretty good. Mickey Lolich has a borderline case for the Hall of Fame. He put up 6 straight seasons of 200+ strikeouts from 1969 to 1974, including an impressive 308 K campaign in 1971. He even won a World Series MVP award in 1968, when he powered the Tigers in their comeback from a 3-1 deficit against the St. Louis Cardinals. He didn't amass many wins, though, going 217-191 in his career. Our model thinks he should be a Hall-of-Famer based on his career strikeout numbers (2,832, node #3), and I think this is a very reasonable mistake to make.
Next, let's look at false negatives:
| Name | W | L | G | GS | CG | SHO | SV | IPouts | H | ER | HR | BB | SO | IBB | WP | HBP | BK | BFP | GF | R | SH | SF | GIDP | ERA | WHIP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Catfish Hunter | 224 | 166 | 500 | 476 | 181 | 42 | 1 | 10348 | 2958 | 1248 | 374 | 954 | 2012 | 57 | 49 | 49 | 7 | 14032 | 6 | 1380 | 86 | 73 | 77 | 3.256 | 1.134 |
| Juan Marichal | 243 | 142 | 471 | 457 | 244 | 52 | 2 | 10522 | 3153 | 1126 | 320 | 709 | 2303 | 82 | 51 | 40 | 20 | 14236 | 11 | 1329 | 44 | 26 | 15 | 2.889 | 1.101 |
| Jack Morris | 254 | 186 | 549 | 527 | 175 | 28 | 0 | 11472 | 3567 | 1657 | 389 | 1390 | 2478 | 99 | 206 | 58 | 27 | 16120 | 10 | 1815 | 113 | 114 | 299 | 3.900 | 1.296 |
| Hoyt Wilhelm | 143 | 122 | 1070 | 52 | 20 | 5 | 227 | 6763 | 1757 | 632 | 150 | 778 | 1610 | 61 | 90 | 62 | 4 | 9164 | 651 | 773 | 4 | 7 | 0 | 2.523 | 1.125 |
Hoyt Wilhelm is an extremely interesting player. Before he made his Major League debut in 1952 at the age of 29, he served in World War II. Wilhelm was awarded a Purple Heart for injuries he sustained from a German artillery blast during the Battle of the Bulge. In his baseball career, Wilhelm pitched out of the bullpen, throwing a knuckleball that he would become famous for. He made a few occasional starts throughout his career, and he even threw a no-hitter in 1958. He pitched until he was 49 in 1972, a feat unthinkable by today's standards. Because our model is trained primarily on starting pitchers, Wilhelm is at a disadvantage - he just simply doesn't have the counting stats that starters do. Because of his longevity and success as a reliever, he deserves a place in the Hall of Fame.
The other three players all have pretty solid Hall of Fame cases, so I'm inclined to blame the mistakes on the limitations of the model (like our depth limit and minimum entropy threshold for splits) rather than the HOF voters. Catfish Hunter won 5 World Series pennants and a Cy Young Award, appeared in 8 All-Star Games, and pitched a perfect game over the course of his relatively short career. Our model doesn't think he pitched enough games or has a high enough winning percentage to make the Hall, but I think that's partially because of his early retirement. He's a surefire HOFer. Juan Marichal and Jack Morris each have impressive resumes, too. Marichal doesn't make the Hall by our model for the same reason as Catfish Hunter. Morris, on the other hand, only has 5 All-Star Game appearances, which puts him at a disadvantage.
Overall, though, it's fair to say that this model performs very well on the task of classifying pitchers.
Modern Players
Just like last time, we can now turn our attention to players that are currently eligible for induction into the Hall of Fame. Here are our predicted future inductees:
| Name | W | L | G | GS | CG | SHO | SV | IPouts | H | ER | HR | BB | SO | IBB | WP | HBP | BK | BFP | GF | R | SH | SF | GIDP | ERA | WHIP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Roger Clemens | 354 | 184 | 709 | 707 | 118 | 46 | 0 | 14750 | 4185 | 1707 | 363 | 1580 | 4672 | 63 | 143 | 159 | 20 | 20240 | 0 | 1885 | 119 | 103 | 329 | 3.125 | 1.173 |
| Clayton Kershaw | 169 | 74 | 347 | 344 | 25 | 15 | 0 | 6824 | 1715 | 617 | 173 | 577 | 2464 | 27 | 88 | 33 | 20 | 8958 | 1 | 672 | 86 | 26 | 156 | 2.441 | 1.008 |
| Craig Kimbrel | 31 | 23 | 565 | 0 | 0 | 0 | 346 | 1660 | 306 | 128 | 44 | 217 | 898 | 5 | 44 | 21 | 0 | 2183 | 466 | 137 | 7 | 4 | 24 | 2.082 | 0.945 |
| Joe Nathan | 64 | 34 | 787 | 29 | 0 | 0 | 377 | 2770 | 690 | 294 | 84 | 344 | 976 | 31 | 47 | 23 | 2 | 3771 | 587 | 317 | 25 | 24 | 49 | 2.866 | 1.120 |
| Jonathan Papelbon | 41 | 36 | 689 | 3 | 0 | 0 | 368 | 2177 | 572 | 197 | 57 | 185 | 808 | 18 | 14 | 34 | 1 | 2938 | 585 | 226 | 21 | 11 | 28 | 2.443 | 1.043 |
| Francisco Rodriguez | 52 | 53 | 948 | 0 | 0 | 0 | 437 | 2928 | 738 | 310 | 98 | 389 | 1142 | 42 | 68 | 17 | 2 | 4011 | 677 | 336 | 24 | 22 | 48 | 2.859 | 1.155 |
| CC Sabathia | 251 | 161 | 561 | 560 | 38 | 12 | 0 | 10732 | 3404 | 1485 | 382 | 1099 | 3093 | 44 | 74 | 123 | 17 | 14989 | 0 | 1623 | 94 | 93 | 316 | 3.736 | 1.259 |
| Chris Sale | 109 | 73 | 312 | 232 | 16 | 3 | 12 | 4889 | 1312 | 548 | 172 | 374 | 2007 | 14 | 38 | 98 | 1 | 6544 | 25 | 590 | 20 | 28 | 109 | 3.026 | 1.035 |
| Max Scherzer | 170 | 89 | 365 | 356 | 10 | 5 | 0 | 6870 | 1883 | 813 | 254 | 618 | 2692 | 19 | 65 | 85 | 7 | 9278 | 2 | 876 | 54 | 47 | 120 | 3.195 | 1.092 |
| Curt Schilling | 216 | 146 | 569 | 436 | 83 | 20 | 22 | 9783 | 2998 | 1253 | 347 | 711 | 3116 | 43 | 72 | 52 | 8 | 13284 | 81 | 1318 | 123 | 76 | 193 | 3.458 | 1.137 |
| Billy Wagner | 47 | 40 | 853 | 0 | 0 | 0 | 422 | 2709 | 601 | 232 | 82 | 300 | 1196 | 26 | 43 | 33 | 1 | 3600 | 703 | 262 | 33 | 14 | 51 | 2.312 | 0.998 |
Let's take a look at their careers:
Roger Clemens: He was fourth on the ballot with 61% of the vote in 2020. Clemens is one of the greatest pitchers to step on the mound, winning 7 Cy Young Awards (most all-time) and 354 games. He was associated with the Mitchell Report, though, which has prevented his induction so far. He'll need another 14% in the next two years to make the Hall before he loses eligibility after the 2022 ballot.
Clayton Kershaw: Even though Kershaw is still pitching, our model predicts that if he retired today he would still make the Hall. Kershaw was arguably the most dominant pitcher of the 2010s, and he has the stats to back it up: a career 2.43 ERA, over 2500 Ks, and a career WHIP of 1.00. He should be a first-ballot Hall of Famer five years after he decides to hang up his cleats.
Craig Kimbrel: Kimbrel is another active player, and unlike many of the others on this list he has pitched exclusively as a reliever. Our model predicts that he'll make the Hall based on his ASG appearances and games pitched. One of the most dominant relievers of the 2010s, he won't appear on the ballot until at least 2025.
Joe Nathan: One of the more controversial picks on this list, Nathan was a six-time All Star who played most of his career with the Twins. He put up solid numbers over 787 appearances, with an ERA of 2.87 and a WHIP of 1.12. However, he was overshadowed in his career by other big-name relievers like Mariano Rivera and Jonathan Papelbon. Nathan will first appear on the ballot in 2022.
Jonathan Papelbon: Papelbon posted an ERA of 2.44 and a WHIP of 1.04 over the course of a 12-year career with the Red Sox, Phillies, and Nationals. He was a key member of the 2007 World Series champion Red Sox team as well. Papelbon was one of the best relievers in the Majors from 2005 to 2015, and our model predicts he will make Cooperstown based on his ASG appearances and number of relief appearances. He will be eligible in 2022.
Francisco Rodríguez: Rodríguez is #4 on the all-time saves leaderboard and made 6 All-Star Games over the course of his 16-year career. Our model classifies him as a Hall-of-Famer because he falls into node #8. He will appear on the ballot for the first time in 2023.
CC Sabathia: Sabathia won a Cy Young Award, made 7 ASGs, and struck out over 3,000 batters over the course of his career. He has a very solid case for Cooperstown, but the voters will ultimately decide starting in 2025.
Chris Sale: Sale was sidelined for all of the 2020 season after undergoing Tommy John surgery in 2019. He has an impressive seven-year peak, but he may need to accumulate some more counting stats before he's a lock (2,007 Ks as of now). We'll see how he returns in 2021.
Max Scherzer: 'Mad Max' was right alongside Kershaw dominating the Majors in the 2010s. He's struck out over 2,700 batters and won 3 Cy Young Awards. He has a very strong case to make the Hall after retirement.
Curt Schilling: Based on numbers alone, Schilling has a very strong case for Cooperstown. He won 3 World Series titles, struck out over 3,100 batters, and made 6 All-Star Games. He also pitched the 'bloody sock' game while leading the 2004 Red Sox in their curse-breaking World Series run. However, he has been the center of controversy for his right-wing political beliefs, which he frequently expresses on Twitter. He was fired from ESPN in 2016 for statements surrounding the North Carolina bathroom law. In his eighth year on the ballot in 2020, he received 70% of the vote.
Billy Wagner: Wagner played 16 seasons, mostly with the Astros, Phillies, and Mets. He was one of the game's best relief pitchers over that span - in fact, he has a lower career WHIP than unanimous 2019 inductee Mariano Rivera. He was a seven-time All Star and was fourth in the Cy Young Award running in 1999. Wagner struggled on the ballot during his first four years, never receiving above 20% of the vote, but he surged to 31.7% in 2020. He has five more years of eligibility.
Just like last time, I'll make a few predictions.
- Locks: Kershaw, Scherzer, and CC have solid cases and will have no problem getting in. CC and Scherzer have over 3,000 Ks, and Kershaw was the most dominant pitcher of a generation.
- Likely: I think Clemens will gain the votes he needs in the next two years, especially because he never officially tested positive for PEDs. Papelbon was one of the game's best for much of his career, and I think his dominant stretch combined with his stats will put him in the Hall. I also think Schilling will make it in, especially because he only needs another 5% of the vote.
- Outsiders: The first 10 years of Kimbrel's career put him firmly on a HOF trajectory. His 2019 and 2020 campaigns, though, are moving his career numbers in the wrong direction. If he can regain his old self and put together 4-5 more solid years, then I think he will have a good chance. While I think Joe Nathan deserves a plaque in Cooperstown. I'm not sure the voters will agree. I'd give him a 50/50 chance. I wouldn't put any money on Rodríguez, either - although he was good, I think he belongs in the Hall of Very Good. Chris Sale is on track, but he needs a few more dominant seasons before he'll be seriously considered for the Hall. Like many of the relievers here, I don't think Billy Wagner will make it, either.
Conclusion
Like last week, I think this model does a decent job at matching human intuition. All of the players here have a Hall of Fame case, even if some of them are less likely than others to make Cooperstown. For the players I disagree with the model on (the 'Outsiders'), it usually comes down to intangibles (like player popularity) or toss-ups. It didn't make any egregious errors on existing players, either.
However, there are some players that it missed. Justin Verlander could probably retire today and make the Hall of Fame. Our model thinks he's accumulated too many losses (129) in his starts (454). Verlander's teammate Zack Greinke has a reasonable chance, too. Maybe we should investigate the model for bias against the Astros.
Lastly, we could also consider breaking this in to two parts - one for starting pitchers and one for relief pitchers. I think a lot of the inaccuracies are a result of there being two separate classes within the broader category of "pitcher". With that being said, the results are already pretty good, and it's impressive that the mode is able to capture the HOF standards of both starters and relievers in one model.
As a reminder, check out the codebase for this project here. Stay tuned for new projects soon, too!
About
In this article, we explore the use of decision trees to develop a classifier for MLB Hall of Fame pitchers and use that model to make predictions on active players.