Using Decision Trees to Predict Hall-of-Famers: Part Two

What makes a player great? Predicting Hall-of-Fame pitchers using machine learning.

Mariano Rivera exiting his final game on September 26th, 2013
Photo: Arturo Pardavila III / Flickr

In our previous article, we used decision trees to predict which eligible players will be inducted into the Hall of Fame in the future based on their batting statistics. This week, we'll do the same thing for pitchers. If you haven't read last week's article, do that first - it has all of the background information needed for this project.

The Problem

Using all of the data available about historical players, can we predict which eligible players will be inducted into the Hall of Fame in the future? In our list of eligible players, we'll include both players that have played within the past five years (and so do not meet the ballot's criteria) and players that have appeared on the ballot but haven't reached the 75% threshold for induction.

Pitching and batting have an interesting relationship in baseball. Except for the occasional Bartolo Colón home run or inning pitched by a position player, the two don't often overlap. To highlight the flexibility of these machine learning techniques, though, we can use decision trees to predict Hall-of-Famers on this dataset, too. In fact, we'll barely have to change any code.

The Data

For the most part, we can treat pitching data like we did our batting data. The Chadwick Bureau Database has all of the stats we'll need, including wins, losses, innings pitched, and earned runs for every pitcher that's stepped on the mound. We'll augment this data by calculating the pitchers' ERA and WHIP. We'll also include some informative stats like All-Star Game appearances, World Series MVP awards, and Cy Young Awards. There are two other important notes: first, we only include players who retired after 1970. This ensures that we're basing our predictions off of the modern game, which is significantly different from early baseball. Additionally, only players who made the majority of their appearances at the pitcher position are considered. That excludes HOF-level batters who may have made a pitching appearance from our dataset.

Last time, we saw how we can use AUC-ROC to better evaluate our model given the imbalanced classes. Because a similarly small fraction of pitchers makes the Hall of Fame, we'll use the same metric again.

Lastly, we have the same regularization techniques in place to prevent overfitting. Our minimum impurity decrease is the same, as is our maximum depth.

Results

First, check out the code used in this article here.

This decision tree looks pretty similar to last week's tree. Interestingly, the root node here splits on All-Star Game appearances, just like last week's root node. Here, the threshold is less than or equal to 5.5 appearances.

If a player has more than 5.5 All-Star Game appearances, they are sorted into the right side of the tree. Already, the majority of these players are Hall-of-Famers, as can be seen from the 'value' field and blue color. The model separates the remaining players by games played (node #4) and losses (node #5). This gives us an idea of what the model is looking for: if a player has pitched a large number of games and compiled enough All-Star appearances, then they're pretty much a lock for the Hall of Fame. If they don't have quite as many appearances, then they need a high winning percentage (and a small number of losses).

On the left side of the tree, the model looks at the total number of career strikeouts. 2,793 strikeouts are required to be classified as a Hall-of-Famer without at least 6 All-Star Game appearances. This threshold is close to one of the major achievements in baseball: 3,000 career strikeouts. HOF voters recognize 3,000 strikeouts as a major career accomplishment, so it's possible our model is picking up on that voting tendency.

The model achieves an AUC-ROC score of 98.02%, and it makes sense on an intuitive level, too. Let's dig into it a little further, though, and see where it makes mistakes. First, we can look at false positives - players that our model predicted would be in the Hall of Fame, but aren't:

Name W L G GS CG SHO SV IPouts H ER HR BB SO IBB WP HBP BK BFP GF R SH SF GIDP ERA WHIP
Mickey Lolich 217 191 586 496 195 41 11 10915 3366 1390 347 1099 2832 67 124 92 8 15140 40 1537 108 59 90 3.438 1.227

Only one player! That's pretty good. Mickey Lolich has a borderline case for the Hall of Fame. He put up 6 straight seasons of 200+ strikeouts from 1969 to 1974, including an impressive 308 K campaign in 1971. He even won a World Series MVP award in 1968, when he powered the Tigers in their comeback from a 3-1 deficit against the St. Louis Cardinals. He didn't amass many wins, though, going 217-191 in his career. Our model thinks he should be a Hall-of-Famer based on his career strikeout numbers (2,832, node #3), and I think this is a very reasonable mistake to make.

Next, let's look at false negatives:

Name W L G GS CG SHO SV IPouts H ER HR BB SO IBB WP HBP BK BFP GF R SH SF GIDP ERA WHIP
Catfish Hunter 224 166 500 476 181 42 1 10348 2958 1248 374 954 2012 57 49 49 7 14032 6 1380 86 73 77 3.256 1.134
Juan Marichal 243 142 471 457 244 52 2 10522 3153 1126 320 709 2303 82 51 40 20 14236 11 1329 44 26 15 2.889 1.101
Jack Morris 254 186 549 527 175 28 0 11472 3567 1657 389 1390 2478 99 206 58 27 16120 10 1815 113 114 299 3.900 1.296
Hoyt Wilhelm 143 122 1070 52 20 5 227 6763 1757 632 150 778 1610 61 90 62 4 9164 651 773 4 7 0 2.523 1.125

Hoyt Wilhelm is an extremely interesting player. Before he made his Major League debut in 1952 at the age of 29, he served in World War II. Wilhelm was awarded a Purple Heart for injuries he sustained from a German artillery blast during the Battle of the Bulge. In his baseball career, Wilhelm pitched out of the bullpen, throwing a knuckleball that he would become famous for. He made a few occasional starts throughout his career, and he even threw a no-hitter in 1958. He pitched until he was 49 in 1972, a feat unthinkable by today's standards. Because our model is trained primarily on starting pitchers, Wilhelm is at a disadvantage - he just simply doesn't have the counting stats that starters do. Because of his longevity and success as a reliever, he deserves a place in the Hall of Fame.

The other three players all have pretty solid Hall of Fame cases, so I'm inclined to blame the mistakes on the limitations of the model (like our depth limit and minimum entropy threshold for splits) rather than the HOF voters. Catfish Hunter won 5 World Series pennants and a Cy Young Award, appeared in 8 All-Star Games, and pitched a perfect game over the course of his relatively short career. Our model doesn't think he pitched enough games or has a high enough winning percentage to make the Hall, but I think that's partially because of his early retirement. He's a surefire HOFer. Juan Marichal and Jack Morris each have impressive resumes, too. Marichal doesn't make the Hall by our model for the same reason as Catfish Hunter. Morris, on the other hand, only has 5 All-Star Game appearances, which puts him at a disadvantage.

Overall, though, it's fair to say that this model performs very well on the task of classifying pitchers.

Modern Players

Just like last time, we can now turn our attention to players that are currently eligible for induction into the Hall of Fame. Here are our predicted future inductees:

Name W L G GS CG SHO SV IPouts H ER HR BB SO IBB WP HBP BK BFP GF R SH SF GIDP ERA WHIP
Roger Clemens 354 184 709 707 118 46 0 14750 4185 1707 363 1580 4672 63 143 159 20 20240 0 1885 119 103 329 3.125 1.173
Clayton Kershaw 169 74 347 344 25 15 0 6824 1715 617 173 577 2464 27 88 33 20 8958 1 672 86 26 156 2.441 1.008
Craig Kimbrel 31 23 565 0 0 0 346 1660 306 128 44 217 898 5 44 21 0 2183 466 137 7 4 24 2.082 0.945
Joe Nathan 64 34 787 29 0 0 377 2770 690 294 84 344 976 31 47 23 2 3771 587 317 25 24 49 2.866 1.120
Jonathan Papelbon 41 36 689 3 0 0 368 2177 572 197 57 185 808 18 14 34 1 2938 585 226 21 11 28 2.443 1.043
Francisco Rodriguez 52 53 948 0 0 0 437 2928 738 310 98 389 1142 42 68 17 2 4011 677 336 24 22 48 2.859 1.155
CC Sabathia 251 161 561 560 38 12 0 10732 3404 1485 382 1099 3093 44 74 123 17 14989 0 1623 94 93 316 3.736 1.259
Chris Sale 109 73 312 232 16 3 12 4889 1312 548 172 374 2007 14 38 98 1 6544 25 590 20 28 109 3.026 1.035
Max Scherzer 170 89 365 356 10 5 0 6870 1883 813 254 618 2692 19 65 85 7 9278 2 876 54 47 120 3.195 1.092
Curt Schilling 216 146 569 436 83 20 22 9783 2998 1253 347 711 3116 43 72 52 8 13284 81 1318 123 76 193 3.458 1.137
Billy Wagner 47 40 853 0 0 0 422 2709 601 232 82 300 1196 26 43 33 1 3600 703 262 33 14 51 2.312 0.998

Let's take a look at their careers:

  • Roger Clemens: He was fourth on the ballot with 61% of the vote in 2020. Clemens is one of the greatest pitchers to step on the mound, winning 7 Cy Young Awards (most all-time) and 354 games. He was associated with the Mitchell Report, though, which has prevented his induction so far. He'll need another 14% in the next two years to make the Hall before he loses eligibility after the 2022 ballot.

  • Clayton Kershaw: Even though Kershaw is still pitching, our model predicts that if he retired today he would still make the Hall. Kershaw was arguably the most dominant pitcher of the 2010s, and he has the stats to back it up: a career 2.43 ERA, over 2500 Ks, and a career WHIP of 1.00. He should be a first-ballot Hall of Famer five years after he decides to hang up his cleats.

  • Craig Kimbrel: Kimbrel is another active player, and unlike many of the others on this list he has pitched exclusively as a reliever. Our model predicts that he'll make the Hall based on his ASG appearances and games pitched. One of the most dominant relievers of the 2010s, he won't appear on the ballot until at least 2025.

  • Joe Nathan: One of the more controversial picks on this list, Nathan was a six-time All Star who played most of his career with the Twins. He put up solid numbers over 787 appearances, with an ERA of 2.87 and a WHIP of 1.12. However, he was overshadowed in his career by other big-name relievers like Mariano Rivera and Jonathan Papelbon. Nathan will first appear on the ballot in 2022.

  • Jonathan Papelbon: Papelbon posted an ERA of 2.44 and a WHIP of 1.04 over the course of a 12-year career with the Red Sox, Phillies, and Nationals. He was a key member of the 2007 World Series champion Red Sox team as well. Papelbon was one of the best relievers in the Majors from 2005 to 2015, and our model predicts he will make Cooperstown based on his ASG appearances and number of relief appearances. He will be eligible in 2022.

  • Francisco Rodríguez: Rodríguez is #4 on the all-time saves leaderboard and made 6 All-Star Games over the course of his 16-year career. Our model classifies him as a Hall-of-Famer because he falls into node #8. He will appear on the ballot for the first time in 2023.

  • CC Sabathia: Sabathia won a Cy Young Award, made 7 ASGs, and struck out over 3,000 batters over the course of his career. He has a very solid case for Cooperstown, but the voters will ultimately decide starting in 2025.

  • Chris Sale: Sale was sidelined for all of the 2020 season after undergoing Tommy John surgery in 2019. He has an impressive seven-year peak, but he may need to accumulate some more counting stats before he's a lock (2,007 Ks as of now). We'll see how he returns in 2021.

  • Max Scherzer: 'Mad Max' was right alongside Kershaw dominating the Majors in the 2010s. He's struck out over 2,700 batters and won 3 Cy Young Awards. He has a very strong case to make the Hall after retirement.

  • Curt Schilling: Based on numbers alone, Schilling has a very strong case for Cooperstown. He won 3 World Series titles, struck out over 3,100 batters, and made 6 All-Star Games. He also pitched the 'bloody sock' game while leading the 2004 Red Sox in their curse-breaking World Series run. However, he has been the center of controversy for his right-wing political beliefs, which he frequently expresses on Twitter. He was fired from ESPN in 2016 for statements surrounding the North Carolina bathroom law. In his eighth year on the ballot in 2020, he received 70% of the vote.

  • Billy Wagner: Wagner played 16 seasons, mostly with the Astros, Phillies, and Mets. He was one of the game's best relief pitchers over that span - in fact, he has a lower career WHIP than unanimous 2019 inductee Mariano Rivera. He was a seven-time All Star and was fourth in the Cy Young Award running in 1999. Wagner struggled on the ballot during his first four years, never receiving above 20% of the vote, but he surged to 31.7% in 2020. He has five more years of eligibility.

Just like last time, I'll make a few predictions.

  • Locks: Kershaw, Scherzer, and CC have solid cases and will have no problem getting in. CC and Scherzer have over 3,000 Ks, and Kershaw was the most dominant pitcher of a generation.
  • Likely: I think Clemens will gain the votes he needs in the next two years, especially because he never officially tested positive for PEDs. Papelbon was one of the game's best for much of his career, and I think his dominant stretch combined with his stats will put him in the Hall. I also think Schilling will make it in, especially because he only needs another 5% of the vote.
  • Outsiders: The first 10 years of Kimbrel's career put him firmly on a HOF trajectory. His 2019 and 2020 campaigns, though, are moving his career numbers in the wrong direction. If he can regain his old self and put together 4-5 more solid years, then I think he will have a good chance. While I think Joe Nathan deserves a plaque in Cooperstown. I'm not sure the voters will agree. I'd give him a 50/50 chance. I wouldn't put any money on Rodríguez, either - although he was good, I think he belongs in the Hall of Very Good. Chris Sale is on track, but he needs a few more dominant seasons before he'll be seriously considered for the Hall. Like many of the relievers here, I don't think Billy Wagner will make it, either.

Conclusion

Like last week, I think this model does a decent job at matching human intuition. All of the players here have a Hall of Fame case, even if some of them are less likely than others to make Cooperstown. For the players I disagree with the model on (the 'Outsiders'), it usually comes down to intangibles (like player popularity) or toss-ups. It didn't make any egregious errors on existing players, either.

However, there are some players that it missed. Justin Verlander could probably retire today and make the Hall of Fame. Our model thinks he's accumulated too many losses (129) in his starts (454). Verlander's teammate Zack Greinke has a reasonable chance, too. Maybe we should investigate the model for bias against the Astros.

Lastly, we could also consider breaking this in to two parts - one for starting pitchers and one for relief pitchers. I think a lot of the inaccuracies are a result of there being two separate classes within the broader category of "pitcher". With that being said, the results are already pretty good, and it's impressive that the mode is able to capture the HOF standards of both starters and relievers in one model.

As a reminder, check out the codebase for this project here. Stay tuned for new projects soon, too!

About

In this article, we explore the use of decision trees to develop a classifier for MLB Hall of Fame pitchers and use that model to make predictions on active players.

Elsewhere

  1. GitHub